Invisible to the eye

Wednesday, December 09, 2009

What everybody ought to know about storing objects in a database

Probably during your career you have heard the term impedance mismatch, which is commonly referred to the problems that arise in converting data between different models (or between different cables if you are into electrical engineering).
Usually the complete expression is object-relational impedance mismatch, which indicates the difficulties of the translation process between two versions of the same domain model: the former resides in memory and it consists in an object graph, while the latter is used for storage and it is a relational model stored in a database. The conversion between parts of both models happens many times while an application runs, and in php's case at least once for every http request.

Object-relational mappers like Hibernate and Doctrine are infrastructure applications which deal with the mismatch, doing their best to implement a transparent mechanism and providing the abstracted illusion of a in-memory model, like the Repository pattern. These particular Orms are the best of breed because they do not force your object graph to depend on infrastructure classes like base Active Records.
The connection between the two models is defined by the developer, by providing metadata about its classes properties: for instance you can annotate a private field specifying the column type you want to use to store its value. But what are the translation rules the developers provide configuration for? Here is a basic set of the tasks an Orm performs for you.

Entity classes are translated to single tables as a general rule, with a one-to-one mapping. The class private or public fields which are configured for storage define the columns of a particular table.
Objects which you pass to the Orm for being stored become rows of the correspondent table. A User class becomes a User table containing one row for every registered user of your website.
A primary key is defined by choosing between the existing fields or inserted ex-novo. Often as a requirement the developer should explicitly define a field.
Repeated single (or multiple) class fields become new tables, and the problem of representing them is shifted to representing relationships; in the domain model this kind of objects are Value Objects, which is semantically different from Entities, but databases only care about homogeneous data and such objects receive no special treatment.
One-to-one and many-to-one relationships can be represented with a foreign key on the source entity that resembles the original pointer to a memory location.
One-to-many relationships are a bit trickier because they require a foreign key on what is called the owning side, in this case the target entity. What can seem strange at first glance is that even if in the domain the relationship is unidirectional (pointer to a collection), elements of a collection need to have a reference to the owner to unequivocally identify it. The mutual registration pattern can be used to build a correct Api starting from this constraint; I will write about it in one of the next posts.
Many-to-many relationships are managed by creating an association table that references with foreign keys the participating entities. Every row constitutes a link between different objects; sometimes it may be the case to use such a table also for one-to-many associations, to avoid having a back reference field on collection elements.
Inheritance is by far the most complex semantic to maintain as it is not supported at all by relational databases: Single/Class/Concrete table inheritance are three famous patterns which organize hierarchical objects in tables, but I prefer to avoid inheritance altogether if not strictly necessary.

And this list is one of the reasons why in your college courses they told you that a model should be as small and simple as possible: a simple model can undergo much more simple transformations for storage and transmission than a complex one.
Note that some contaminations leak from the database side to the object graph, such as the bidirectionality of one-to-many relationships, present even when it is not required by the domain model.
Orms take care for you of this translation process and can even generate the tables from the classes source code, but they only perform automatically the tedious part of object-relational mapping. You should know very well how the mapping works if you plan to use such powerful tools without reducing your database to a list of key/value pair.

The image at the top is an Entity-Relationship model used to design database schemas. I find it not useful anymore as I now prefer to think in classes terms, with Uml diagrams.

Tuesday, December 08, 2009

5 reasons to be happy in a terminal

I am not talking about an airport terminal, but about one of the terminal emulators which are provided by modern window managers, like gnome-terminal for Gnome and the similar Konsole for Kde, along with the minimal xterm. These are all unix applications but equivalent applications exist for other platforms like Windows, although their integration with the underlying operating system and with specific programs can be tricky.
Why should you, a software developer/engineer, want to pass most of your time in a dumb terminal instead of in a powerful and costly IDE? I have five reasons to convince you.

instant access to unix programs. GUIs facilitate the job of naive users but the real power resides in the command line tools which perform the real work; moreover, cli programs can be chained in infinite ways using their universal plain text interface.
the classic 80x25 terminal has short lines and a short number of lines: too much logic in a line stands out because the line wraps on the subsequent one. Too long methods are spotted as well, because they don't fit in a single or double screen and require multiple scrolling.
transparent remoting with ssh. The same can be said for VNC, but it can be very slow and it's not always supported while many servers have a ssh daemon. It is so fast I did not notice the latency between my local machine and other boxes in my Lan, so I have given them different colored prompts to easily distinguish between environments.
uninterrupted flow; not using the mouse makes you move very quickly in the cli environment, once you know what to write and how to leverage the text- based tools.
every executed command is registered for possible future repetition and modification. Try record a procedure of 90 control panel clicks instead.

As a side note, thanks to history, it's also very simple to calculate statistics on which are the most popular commands you type on your development box:

[12:29:32][giorgio@Indy:~]$ history | awk '{print $2;}' | sort | uniq -c |
> sort -nr | head -n 10
   4924 vim
   1326 svn
    879 nakedphpunit    // it's an alias for phpunit --bootstrap=...
    616 sudo
    438 cd
    266 ls
    238 osstest_sqlite
    207 phing
    135 ./scripts/regenerate
    127 grep

Of course most of them were only typed the first time and then recalled. From these data you can infer that I use the command line interface a lot, and I've never been more productive. This statistic is a typical leverage of command line tools in a construct that took me less than a minute to write and that I can repeat whenever I want in a few seconds.

Sooner or later, the time comes when a developer feels constrained by his graphical interfaces and resorts to use the command line directly. If he avoids the command line, probably it's because he does not know how to work with it. Don't be so proud like this developer and take some time to learn: the cli will repay you soon.

Monday, December 07, 2009

PHPUnit and Phing cohabitation

During the publication of Practical Php Testing some readers asked me to include information on how to make PHPUnit and Phing work together. It was not possible due to time constraints to include an appendix on this topic, so I will talk about it here.

First, some background:

PHPUnit is the leading testing harness in the php world: it consists in a small powerful framework for defining test cases, making assertion and mocking classes.
Phing is an Ant clone written in php, that should become the standard solution for automating php applications targets such as deployment, running different test suites at the same time and generating documentation. Why using Phing instead of Ant? Because it interfaces well with php applications.

Integrating these two tools means giving Phing access to a PHPUnit test suite and letting the phing build files, which manage configuration, contain also information on how to run the test suite. In the build.xml file of an application you should find different targets like generate-documentation, test-all, compile-all (if php were a compiled language), and so on.

There are two ways for accessing PHPUnit test suites via phing: exec and phpunit tasks.
At the time of this writing, the phpunit task bundled in stable releases of Phing lacks functionalities, primarily the ability to define a bootstrap file to execute before the test suite is run. I can't live without --boostrap and I look forward to a release of Phing that lets me specify this file in the configuration.
This release will be Phing 2.4.0 (at least a Release Candidate 3 version, while on December 2009 it is in RC2). There are two things that are being fixed and that would annoy the average developer a lot:

There was a bug in the last RC release affecting the bootstrap parameter, and the inclusion took place too late in the process, producing fatal errors when the suite for instance relies on autoloading. This bug is fixed in the repositories and will be gone in the next RC release. I downloaded the simple patch and applied it manually to try out the bootstrap functionality and it works very well. (http://phing.info/trac/ticket/378)
The summary formatter is not a summary: it uses the wrong hook method, producing a report for every single test case and resulting in an output hundreds of lines long. I opened a ticket to tackle this issue. (http://phing.info/trac/ticket/401)

What we will be able to do

<target name="test">
    <phpunit bootstrap="tests/bootstrap.php">
        <formatter type="summary" usefile="false" />
        <batchtest>
            <fileset dir="tests">
                <include name="**/*Test.php"/>
            </fileset>
        </batchtest>
    </phpunit> 
</target>

When you push the big test button on your desktop (from the cli type phing test), this xml configuration will hopefully produce a report while your test suite runs.
The problems with this approach are it does not work yet due to the bugs I have listed earlier, and that it eats quite a bit of memory, forcing me to increase the limit to 128 Megabyte for a suite composed of 144 unit tests.

What we do now
Until a stable version of Phing 2.4 is released, we should rely on exec commands, which directly call the phpunit binary executable (not so binary: it is in fact a php script):

   <target name="test">
        <exec command="phpunit --bootstrap tests/bootstrap.php

--configuration tests/phpunit.xml --colors"

dir="${srcRoot}" passthru="true" />
        <exec command="phpunit --bootstrap=example/application/bootstrap.php

--configuration example/application/tests/phpunit.xml --colors"

dir="${srcRoot}" passthru="true" />
    </target>

$srcRoot is a property that specifies the working directory to run the phpunit command in. passthru makes the task echo the output of the command.
This approach is sometimes more flexible than using the specialized phpunit task. More flexible in the sense that you don't have to expect that phing includes in its tasks options for configuring new phpunit features, because you can use them just as they are available from the command line. On the other hand, it may be difficult to perform different actions (like lighting up a red semaphore in your office) basing on the last build state (red or green).

So I'm relying on exec tasks for now. By the way, the result is pretty and colors are even conserved, but I have to expect the end of the exec command to see any output (no dots slowly piling up on the screen).
If you enjoy using Phing and PHPUnit, please provide feedback and contribute to the projects, especially in the case of Phing. It is a project that deserves more attention from the community due to its integration tasks.
UPDATE: Phing 2.4.0 was released on January 17, 2010.

Friday, December 04, 2009

Evolution of inclusion

Once upon a time, there was the php include() construct. Reusing part of pages and other code was as simply as:

<?php
include 'table.php';

Include files gained the same scope of the parent script, with no need to look for global variables somewhere.
The problem came when there were parameters that influenced the final result, and the included html was a template expecting its variables to assume a value:

<?php
$user = new User();
$entries = array(...);
$showCaption = true; 
include 'table.php';

This programming style is also known as Accumulate and fire, and it exposes a not very good Api. There is no way to learn the variables needed to the template, nor to immediately signal errors in population: if I comment $user assignment or set it to an array, the script would not notice until the variable is used deep in table.php.

So the php programmers approach evolved in writing functions and classes that generate html as their only responsibility. These classes are called View Helpers.
Since the generation process involves calling a method, the method's signature takes care of exposing the list of parameters and to validate them.
Some dependencies can be one-time injected via the constructor of a view helper (or via setters), but not every piece of code is prone to being put in a view helper because not every line of code is actually reused. Often this low in logic php code is placed in View classes (usually view scripts in php). View scripts are very simple for designers to modify, although some people think that they should absolutely interpose a layer between the php code and the front-end designers.

View scripts can include an header.php or footer.php to avoid duplication of common code, but this solution does not remove duplication: it only reduces it. Try to change the filename of header.php and you will see.
Thanks to url rewriting, the single point of entry pattern became trivial to implement, and now every serious framework has only one or two top-level php files which are loaded in the browser with different parameters, and that determine which action to perform and which view script to show to the end-user (MVC paradigm in php applications).

Still, there was the problem of configuration: sometimes we need pages with the menu on the right and sometimes not (maybe a forum index is too large). In printable versions no navigation has to be shown; in other pages some submenu can be open or closed depending on the context.
And so the famous Two Step View pattern was implemented, and its process now minimizes code duplication:

the action chosen by url parameters is executed and its output consist of some variables, which populates the view. Still there is no Api to refer to, but if you keep view scripts small the problem does not arise often, and there is no mandatory scope mixing by direct include() usage.
the chosen first view is rendered and its generated content is saved in a variable, usually via output buffering. In Zend Framework, Zend_View is the object that manages the rendering of a script and which acts as a generic view class. Using view scripts instead of view classes eliminates the need for a templating language: imagine web designers modifying a php class.
then the chosen second view is rendered, passing the first result as a variable named by convention, for example $content. In Zend Framework, the second view is a script which is managed by Zend_Layout.
both view scripts, which we shall call the view and the layout from now on, have access to view helpers, object injected in the way and in the form you prefer, so that different kinds of tedious html generation can be kept in their cohesive classes and tested independently.

Sometimes a view script still include()s another... But if the code gets complex the latter view script can usually be refactored and transformed in a view helper.

Thursday, December 03, 2009

Sequels and reboots

In The Mythical Man-Month, one of the most famous book about software engineering, Fred Brooks says, referring to software projects:

Chemical engineers learned long ago that a process that works in the laboratory cannot be implemented in a factory in one step. An intermediate step called the pilot plant is necessary [...] Hence, plan to throw one away; you will, anyhow.

I see this as a perfect reverse situation of movies production.
It is indeed true that in the motion picture industry sequels often ruin the feelings of the original movie and at least do not come close to its perfection (though they can surpass it in success due to advertising campaigns and public's expectations).
Consider these science-fiction movies as an example:

Star Wars: it is widely believed that no sequel or prequel can measure up with the original A New Hope. The final, which I don't want to spoil here, is probably the most scene in the history of the genre.
The Terminator: although every sequel contains the catch-phrase Come with me if you want to live, the original 1984 movie is still the most revolutionary.
The Matrix: should I say anything?

In software projects instead the 1.x version is usually the first version to throw away before upgrading to one of the subsequent world-changing releases:

The Linux kernel: today at version 2.6 it finally supports the majority of devices and will never require you to insert a driver cd (in the worst case to compile drivers, which is really annoying).
OpenOffice.org, whose 3.x version is becoming more and more diffused and has recently reached 100 million downloads.

Open source projects often are reluctant to mark the 1.x release of a software before it is really complete and tested. Although this humbleness, often they experience success in the releases that come later, maybe for the fact that is the wide adoption of the first stable version which exposes for the first time architectural problems and other issues in a large installed base. The developers effort is alzo minimized since a 2.x version sees the light only if there is enough momentum and following from the first release.

So why not citing php sequels, which has reached version 5.3?
Extending the cinematographic comparison, often we watch reboots instead of sequels. This kind of movies is very fashionable nowadays:

007 reboot, which starts with James Bond earning his 00 license.
Batman Begins and The Dark Knight, that rewrites the franchise from the moment Bruce Wayne became Batman.

Rewrites is the right word, since a movie franchise reboot corresponds to a major code rewrite. A rewrite is different from an upgrade since not only it breaks binary and Api compatibility, but it consists also in throwing away entangled and coupled code to write from scratch a new solution. The border between the two is not strictly marked, and many major releases are in fact rewrites which maintain the same name for brand popularity reasons.
Such rewrite is different from a movie reboot in the way that it maintains some continuity between the old and the new version, for example in the Ubiquitous language, but it is much more dangerous and it can lead to a never-ending development phase (anyone knows Mozilla suite's fate.)
Ports can also be considered reboots if started from empty source files, and often the original source code is unavailable. Today the most famous ports start from a commercial application to reimplement as an open source one. Forks and ports of open-source application instead recycle code and remain connected with the original project.
Besides the issues in rewriting from scratch, there are successful attempts of reboots in the open source software ecosystem:

Apache 2 is a substantial rewrite according to Wikipedia; Apache original duct-taped version gained its name from the phrase a patchy web server.
Php rewrite before version 3 and 4, which saw the introduction of the Zend Engine.
Grub, the bootloader for GNU/Linux machines, has recently been upgraded to the complete rewrited Grub 2 in Ubuntu, but no one I have seen using Karmic Koala noticed the difference.

Are you writing a reboot, a sequel or an original movie? :)

Wednesday, December 02, 2009

Practical Php Testing is here

Download now

Practical Php Testing, my ebook on testing php applications, is finally here as promised, in the first days of December.

How many times in the last month have you seen a broken screen in the browser? How many times did you have to debug in the browser, by looking at the output, inserting debug statements and breaking redirects? How many times did you perform manual testing, by loading a staging version of your application and tried out different workflows in the browser?

If the answer to these questions is more than very few, it's likely that
you should give automated testing a chance.

This book is aimed to php developers and features the articles from the Practical php testing series, while the other half of it is composed by new content:

bonus chapter on TDD theory;
a case study on testing a php function;
working code samples, some of whom were originally kept on pastebin.com;
sets of TDD exercises at the end of each chapter;
glossary that substitutes external links to wiki and other posts, to not interrupt your reading with terms lookup.

The book comes for free and is licensed under Creative Commons. This phrase means you are free to copy it and give it to anyone. If you find my work useful and you want to be supportive, you can make a donation with the link on the right menu or with the one provided in the book.

Tuesday, December 01, 2009

Practical Php Testing errata

This is the errata page of the Practical Php Testing ebook, where typos and other errors will be listed and corrected. This page is linked to in the to Errata section of the book, to provide updates if errors are found without releasing different and confusing versions of the book.

Practical Php Testing will be published on December 2, 2009.