Wednesday, October 07, 2009

Modern database models

The word database is often heard during a developer's education and a decent database knowlege is considered fundamental. Nearly every serious application relies on a database to work and although nowadays the usual choice is a relational database, understanding other models helps to explore new ways of dealing with data and new, emergent products.

The database word indicates the collection of data saved somewhere in the infrastructure: this blog saves posts in a database, Firefox saves preferences in a little database in your filesystem. When we are talking of MySQL and Sql server, the correct term is Database Management System, or DBMS. A DBMS provides all the procedures and libraries needed to access and modify the database content, abstracting away the need to use low-level filesystem functions.
The abstraction a dbms builds upon its storage engine is called model. There is more than one model for representing data: some have become obsolete and other ones are so widely used that they probably will never disappear in the next twenty years no matter what happens in the database scenario.
Let's start talking about database model. The models I am presenting here are logical models, which specify how information is presented to the user or to the application which talks to the database. This model must be distinguished from the physical model, which consists in the way the dbms chooses to persist data on disks, tapes and other mass memory devices.

Flat model
When you open a spreadsheet or a simple ini file, you are using a flat model. Data are organized in one list or table, with similar elements or rows.
The problem with a flat database is representing the relationships between elements. For instance how do you associate two users that are friends or a user with the groups he chooses to belong to? Although this limitations, flat files and databases are useful because of their data access simplicity where there is the need for lightweight and easily implementable systems, such as config files or spreadsheet saved in the comma-separated values format.

Hierarchical model

The first improvement to the flat model, applied for the first time in the 1960s, is the addition of a pointer to every instance of data (which is called record). This pointer establish a child-parent relationship towards another record, where every one of them has at most one parent.
Information is thus presented in a tree structure, which is a good model for many real world entities. For instance, you can represent too much things in Xml, which is a hierarchical model too. The Dns system and Ldap protocol present a hierarchical model, but they are a specific application of this paradigm and not a dbms-like product.
A variation of the hierarchical model is named network model, which removes the limit of one parent and places a set of many pointers into every record. The tree-like model becomes a graph. This concept dates back to the 1965, so there's no buzz around network models and hierarchical general-purpose databases.

Relational model
The relational model is the widely used one which I was referring to at the start of this post. Relational dbms like MySQL, Microsoft SQL Server, PostgreSQL and even Sqlite constitute the majority of today's applications storage mechanisms.
The relational model describes data in various tables, where each row has a fixed set of fields that form the table's columns. Continuing with our example, a User and a Group tables, with their fields lists, are a relational model.
In this model, relationships between entities are established with the equality of some columns, often named primary keys and foreign keys. If you have not previously lived under a rock, you probably have used these databases a lot so I won't bore you anymore.

Object model
This is when the problem becomes interesting. Data that were presented in a relational model yesterday is being substituted by an object model, built with classes and instances. Maintaining the whole object graph in memory is usually too expensive to be feasible, since it requires enormous amount of resources like memory and cpu cycles to search objects in the mess of a 2-million-objects graph.
To preserve the object model and persists data at the same time, various solution have been proposed in the years:
  • Serialize the objects and put the binary stream on disk. Simple, but how do you search a User instance by his nick when he signs in?
  • Mapping the objects to a relational database, more or less saving every object as a row of a table which corresponds to the class of choice. An Object-Relational Mapper is the tool used for these operations, but it has some limits, for instance in dealing with class inheritance; these limits are known as the impedance mismatch. The Orm is also used for retrieval, ideally abstracting away the relational storage from the application.
  • Put the objects in an object database.
So, an object database is a tool built from scratch to persist an object model and to allow retrieval of small parts of the overall graph with procedures similar to the relational one (ordering, selection).
An object database would be great to use in real world, but currently going the Orm route is the standard since relational databases are at the world's center. Data typically don't fall from the sky, and there is a need for synchronization between applications and machines in a relational database. Thus, different object models can work on the same data and even with applications which don't use an object-oriented paradigm.

Document model
Instead of presenting a fixed structure, a dbms can instead show a semi-structured model, where records have no enforced lists of attributes. These type of entities are called documents, and an application or middleware which relies on it can store nearly everything as a document property: the advantage of this technique is that you'll never have to update a schema. An example of a document-oriented, open source database is CouchDB.

The schema-less novelty is one of the last buzzword in the database world, and it's still not clear what will the future of these solutions be. Relational databases are probably here to stay as there is a lock-in from applications all over the world to their data model. Object and document models are often presented as a panacea to improve scalability and simplicity, but they are not a standard at the moment. Try to explore new persistence solution, as the technology changing pace is slow in this field, but it exists.

In the image at the top, a typical relational model for an employees table, with the specification of primary key, fields and foreign keys.


romanb said...

Good post.

I want to add a bit to the topic of object databases. Many people (most of them did not yet even use an object database) think it is the solution to all their troubles with ORM solutions. However, this is not the case.
You pointed out one major problem already: accessibility of the data.

Furthermore I have unsurprisingly made the experience that object databases and ORMs are actually pretty similar in usage. What is completely removed is the need for specifying mapping information (that class to that table, that field to that column yada yada..) and the often cumbersome management of the second data model, the relational one.

But thats not the end of the story, object databases have the same problem as ORMs: the can usually not just fetch the whole database into memory. As a result, when using an object database, you need to deal with many of the same things and configuration issues as with an ORM: activation/lazy load, update depth/cascading, etc.
This does not go away.

Object databases absolutely have their strenghts and are a great fit for many applications, I think they especially shine on embedded systems but they're certainly not the ultimate solution to the object persistence "problem" (there just is no such ultimate solution, just many of them, each with their own strengths and weaknesses).

Giorgio said...

Thanks for the additions Roman. The php world, which I care a lot about, is slowly filling the gap with other languages like Java, both in object modelling and in object persistence.

Franco said...

Ironically, the "flat file" movement is gaining some traction these days under the "NoSQL" brand. There are a lot of (production) databases that are "row" oriented. See for more information.

Giorgio said...

I am happy that there are serious alternatives to the relational database. Of course all the applications that date back than 2000 probably run a rdbms - but we still have a choice for the new ones, since our job is also to find a tool that fits the requirements. Relational databases have the best transactions support, but the focus nowadays is shiting on scalability.