Monday, October 19, 2009

Data structures fundamentals

This is an introductory article, tagged under basic. Please do not feel offended if these arguments are obvious to you, as there was a time when you and I did not know anything about software.

Algorithms and data structures is a fundamental course in computer engineering. This post will explain the basic data structures that many programmers encounter during their daily work, provided by libraries or programming languages or by userland implementations. If you had ever heard of stacks, lists and trees, you already came in contact with data structures.

What is a data structure?
A data structure is a logic model to organize information. The reason for this organization are primarily of efficiency. However, while the physical model of data and metadata is tipically sequential and indexed (think of a computer's memory), the logical model assumes very different forms.
In the object-oriented paradigm, this distinction is typically achieved by encapsulating the physical model using class private members and abstracting the logical model in an interface this class implements. Thus, it is not only the physical model of data that is kept in a specific implementation, but also the algorithms instances that perform the operations prescribed by the abstraction. The physical data structures and their algorithms are the subjects taught in basic courses in college.
Back in the C++ days, there were no interfaces but only basic classes with pure virtual methods (abstract data types), which the chosen implementation inherited from. Though, there is no real difference in the physical data structures used as the lower layer is the same, a sequence of blocks in Ram or disk memory. Lowering further the level is usually necessary to exploit the performance of a medium (for instance in data structures for backup tapes).
A note on implementations: every abstraction leaks some information of the underlying data structure and data types are no exception. In this case the computational complexity of the various implemented methods is visible under the cover of an interface. The classic example is the comparison of an array and a list: while the first offers O(n) insertion and O(1) direct access, the second exactly inverts the complexities of these operations.
In some cases, the abstraction is not even present and the methods reflect the internal organization of the structure.

List of common data structures
  • Array/List. Defined as linear data structures, which can be maintaned in order in respect to a field of the elements contained or arbitrarily. The difference is in the internal organization, which is in a single block for an array and in small, linked blocks of memory for a list. Examples: the classical java.util.ArrayList, php numeric array.
  • Stack. A structure where elements are piled up every time there is an insertion and only the topmost one can be removed. Examples: java.util.Stack, SplStack in php
  • Queue. Specular to the stack, a linear structure where elements are inserted at one end and extracted at the other, conserving the insertion order. Examples: java.util.LinkedList in Java (which implements the java.util.Queue interface), SplQueue in php.
  • Tree. Hierarchical structure where elements can have a certain number of children and one parent. Trees can be used as-is or to implement a more complicated data structure such as a dictionary; when they are used by a high-order data structure they are usually optimized in one of the different hundreds of existing trees types. Examples of standalone trees: javax.swing.tree.TreeModel, a nested set abstraction layer like Doctrine_Tree (used in conjunction with a relational database).
  • Priority queue. A modified queue where elements are extracted according to their priority and not while respecting the original order. For instance, a heap is a tree used to implement a priority queue.
  • Dictionary/Map. Random-access structures that can give back an element by its specified key, often in O(1); this computational complexity is equivalent to a retrieval time independent from the number of elements stored and maps are much better than linear structures like lists from this point of view. Sometimes the border between maps and hashmaps is subtle as the latters are used to implemente the formers, but there are other physical models for dictionaries/maps, for instance binary or N-ary trees. Examples: java.util.HashMap, java.util.TreeMap, php associative array.
  • Graph. A superset of a tree where the requirement for elements to have only one parent is removed. Elements present a generic number of association to 1, 2, ..., N other ones. Example: domain objects are viewed from an Orm as a generic graph to persist in a database.
These structures can be combined recursively thanks to the abstraction tools provided by programming languages, such as templates and polymorphism. You can have a list of graphs in your application, or a tree of stacks (where the basic node is a stack). The ones listed here are the basic blocks used to construct more complex structures. Programming languages like Java provide also more sophisticated abstractions such as Sets and Bags (as an Hibernate mapping).

Do you think I have left out any fundamental data structure in this list? Do you feel there is enough support in the Php and Java world for these abstractions?


Unknown said...

Frankly, data structures in PHP are a pain point to me. I don't like how the php "array" is supposed to be a list and a map at the same time and as a map its even retricted to numeric or string keys. I miss a sophisticated collections API in the SPL, not only for specialized data structures as it is now, but for the basic ones, too like lists, maps and sets. Something similar to the Java, .NET or Scala collections with proper interfaces. You could still use a php "array" whenever its convenient but then you could also alternatively use an object-oriented and more sophisticated API of a certain data structure if you wanted to. SplCollection, SplSet, SplList, SplMap, ... anyone?

Unknown said...

Just look at how we have to invent collection interfaces for ORM solutions ... ;) it just sucks.

Giorgio said...

Yes Roman, php is slowly introducing object-oriented features but Spl is still small and other extensions are not a standard. Maybe after all the major frameworks have created collection classes an interface will eventually be extracted. :)