Thursday, October 29, 2009

The Repository pattern

A Repository is a Domain-Driven Design concept but it is also a standalone pattern. Repositories are an important part of the domain layer and deal with the encapsulation of objects persistence and reconstitution.

A common infrastructure problem of an object-oriented application is how to deal with persistence and retrieval of objects. The most common way to obtain references to entities (for example, instances of the User, Group and Post class of a domain) is through navigation from another entity. However, there must be a root object which we call methods on to obtain the entities to work with. Manual access to the database tables violates every law of good design.
A Repository implementation provides the illusion of having an in-memory collection available, where all the objects of a certain class are kept. Actually, the Repository implementation has dependencies on a database or filesystem where thousands of entities are saved as it would be impractical to maintain all the data in the main memory of the machine which runs the application.
These dependencies are commonly wired and injected during the construction process by a factory or a dependency injection framework, and domain classes are only aware of the interface of the Repository. This interface logically resides in the domain layer and has no external dependencies.

Let's show an example. Suppose your User class needs a choiceGroups() that lists the groups which an instance of User can subscribe to. There are business rules which prescribe to build a criteria with internal data of the User class, such as the role field, which can assume the values 'guest', 'normal' or 'admin'.
This method should reside on the User class. However, to preserve a decoupled and testable design, we cannot access the database directly to retrieve the Group objects we need from the User class, otherwise we would have a dependency, maybe injected, from the User class which is part of the domain layer to a infrastructure class. This is want we want to avoid as we want to run thousands of fast unit tests on our domain classes without having to start a database daemon.
So the first step is to encapsulate the access to the database. This job is already done in part from the DataMapper implementation: [N]Hibernate, Doctrine2, Zend_Entity. But using a DataMapper in the User class let its code do million of different things and call every method on the DataMapper facade: we want to segregate only the necessary operations in the contract between the User class and the bridge with the infrastructure, as methods of a Repository interface. A small interface is simply to mock out in unit tests, while a DataMapper implementation usually can only be substituted by an instance of itself which uses a lighter database management system, such as sqlite.
So we prepare a GroupRepository interface:
<?php
interface GroupRepository
{
    public function find($id);

    /**
     * @param $values   pairs of field names and values
     * @return array (unfortunately we have not Collection in php)
     */
    public function findByCriteria($values);
}
Note that a Repository interface is pure domain and we can't rely on a library or framework for it. It is in the same domain layer of our User and Group classes.
The User class now has dependencies only towards the domain layer, and it is rich in functionality and not anemic as we do not have to break the encapsulation of the role field and the method is cohesive with the class:
class User
{
    private $_role;

    // ... other methods

    public function choicesGroups(GroupRepository $repo)
    {
        return $repo->findByCriteria(array('roleAllowed', $this->_role));
    }
}
Now we should write an actual implementation of GroupRepository, which will bridge the domain with the production database.
class GroupRepositoryDb implements GroupRepository
{
    /**
     * It can also be an instance of Doctrine\EntityManager
     */
    public function __construct(Zend_Entity_Manager_Interface $mapper)
    {
        // saving collaborators in private fields...
    }

    // methods implementations...
}
In testing, we pass to the choicesGroups() method a mock or a fake implementation which is really an in-memory collection, and which can have also utility methods for setting up fixtures. Unit testing involves a few objects and keeping them all in memory is the simplest solution. Moreover, we have inverted the dependency as now both GroupRepositoryDb and User depend on an abstraction instead of on an implementation.

Another advantage of the Repository and its segregated interface is the reuse of queries. The User class has no knowledge of any SQL/DQL/HQL/JPQL language, which is highly dependent on the internal structure of the Group class and its relations. What is established in the contract (the GroupRepository interface) are only logic operations which take domain data as parameters.
For instance, if Group changes to incorporate a one-to-many relation to the Image class, the loading code is not scattered troughout the classes which refer to Group, like User, but is centralized in the Repository. If you do not incorporate a generic DataMapper, the Repository becomes a Data Access Object; the benefit of isolating a Repository implementation via a DataMapper is you can unit test it against a lightweight database. What you are testing are queries and results, the only responsibility of a Repository.
Note that in this approach the only tests that need a database are the ones which instantiate a real Repository and are focused on it, and not your entire suite. That's why the DDD approach is best suited for complex applications which require tons of automatic testing. Fake repositories should be used in other unit tests to prevent them from becoming integration tests.

Also the old php applications which once upon a time only created and modified a list of entities are growing in size and complexity, incorporating validation and business rules. Generic DataMappers a la Hibernate are becoming a reality also in the php world, and they can help the average developer to avoid writing boring data access classes.
Though, their power should be correctly decoupled from your domain classes if your core domain is complex and you want to isolate it. Repositories are a way to accomplish this decoupling.

29 comments:

Matthew Weier O'Phinney said...

Small correction: your GroupContainerDb class should implement GroupContainer.

Giorgio said...

Thanks! Now it's fixed.

Anonymous said...

Nice, article, but I think that write DAO classes is more suitable, specially in PHP where DAO libraries (Zend_Db family) are more usable then Java JDBC.

I have lots of notices from Java people, they said that using ORM (Hibernate) in more complicated situation can produce very unoptimized SQL queries.

Giorgio said...

What you gain it's freedom from the database.. :)

Anonymous said...

No, freedom from database doesn't matter, majority of companies use one type of database for their business web applications many years mostly forever.

Optimalized data access layer for our application matters more then freedom from db.

Giorgio said...

It depends on your requirements, but changing the dbms vendor does not interest me neither. What I care for is being able to test all my domain classes in a few seconds using only ram, without setting up a database.

Fedyashev Nikita said...

Thanks for this post!

It really gave my some inspiration to look more closely at this pattern because it solves many problems that I face now with Active record pattern and heavy database interaction in unit-tests.

Can you please give an example of how this mock in-memory database can look like in unit-tests?

Giorgio said...

My practice is to use an instance of the Data Mapper which relies on an sqlite in-memory database instead of the production one (usually mysql), and that is throwed away after . In this case it will be an instance of Entity Manager.
The instantiation depends on the actual ORM that you are using, but you should only specify a different DSN.

Giorgio said...

My practice is to use an instance of the Data Mapper which relies on an sqlite in-memory database instead of the production one (usually mysql), and that is throwed away after . In this case it will be an instance of Entity Manager.
The instantiation depends on the actual ORM that you are using, but you should only specify a different DSN.

Unknown said...

Somehow it doesn't feel right that the GroupRepository needs to be injevted in the choicesGroups. Isn't it better that the GroupRepository is injected in the user object during construction?

Giorgio said...

The important part is not to create it in the User code but having it injected, via constructor, setters or as a method parameter. Normally I would inject it, but I prefer not to keep field references to services in Entity classes (User, Group, Post) to simplify their management (creation, reconstitution from database via Orms, serialization...)

Unknown said...

But what if you pass this User object to a presentation layer and need to display the choisesGroup there, Than your presentation layer needs to know about a repository. I like the repository and think I will personally inject it during object construction.

Giorgio said...

It largely depends on the use cases (what if I'm passing it to a cache?)

lakiboy said...

Nice post. Actually all series is wonderful. Keep it going.

I must agree with previous commenter regarding client's aware of repository. This brought me to the following question. Consider the example:

class User_Model_UserMapper extends SomeAbstractMapper
implements User_Model_UserGateway
{
...
/**
* Implementation of User_Model_UserGateway:: fetchByEmail()
*/
public function fetchByEmail($email)
{
$select = $this->getDbTable()->select()->where('email = ?', $email);
$result = $this->getDbTable()->fetchRow($select);
if ($result instanceof Zend_Db_Table_Row) {
$result = new User_Model_User(result, $this);
}
return $result;
}
...
}

It is easy to pass a mapper instance to the user object during its instantiation. In fact User_Model_UserMapper ::fetchByEmail() is an implementation of relevant interface method. I can also set/change the gateway later if needed with User_Model_User::setGateway(User_Model_UserGateway $gateway). But doing it later means that client code needs to be aware of it.

Hence my questions:
- how repository (gateway in my example) differs from data mapper here?
- is there any flaw in above example? i mean is it ok for mapper to implement repository's interface? maybe some other design issue?

Thanks.

Giorgio said...

The standard relationship between DataMapper and Repository|Gateway is that a typical implementation of Repository has-a DataMapper as a private member and implements its interface (which Clients depende upon.) Repository limits the operation available to Clients to its methods, while a DataMapper can be highly generic with no segregation, being it used as a library (e.g. Doctrine 2|Hibernate and clones).

lakiboy said...

Thanks for your answer. Actually googled your blog already and found all the answers I was looking for.

Anonymous said...

Question, domain objects such as user can also came from repository?

Anonymous said...

Who is responsible in creating domain objects especially when the data of the domain object will came from a database (e.g. user)? Should there be a Factory or Repository should also handle it?

Anonymous said...

Can you show a sample of saving a domain object? how it is implemented in repository?

Giorgio said...

Yes, the typical User when coming from the database it is *reconstituted* by the Repository, which in turn uses a Data Mapper. For an example of generic Data Mapper look for Doctrine 2, the implementation is not banal.

Jeboy said...

Repository in DDD also serves as collection of Aggregate Roots. My question is, in retrieving a Domain Object which has child entities (Aggregate) via Repository how are those child entities of the root be retrieved as part of the parent Domain Object?

Thanks

Giorgio said...

Usually a Data Mapper fills in any object referenced by the Aggregate Root, either if they are children that must stay inside the Aggregate or other Entities. The difference is in the domain code: children which are internals of an Aggregate Root will not be exposed by getters (the Data Mapper access everything via reflection.)

Jeboy said...

@giorgio, do you have sample code to illustrate your idea of constituting an aggregate from repository? thanks

Giorgio said...

*Reconstituting* an object is the responsibility of the Data Mapper, so you can take a look at Doctrine 2 tests:
http://github.com/doctrine/doctrine2/tree/master/tests/Doctrine/Tests/
The Repository provides a domain-specific Api on top of the Data Mapper's one. Since a Repository is specific to an Aggregate Root, the object composed by its entities won't be accessible via getters like the ones in Doctrine/Tests/Models.

Anonymous said...

Can Zend_Db_Table be used instead of Doctrine as mapper?

Giorgio said...

It depends on what you mean: if you compose a Zend_Db_Table object and use native Plain Old Php Object it will be no different from using Doctrine. If you use Zend_Db_Table_Row instances as domain objects, won't be good...

dirsen said...

Hi Giorgio,

nice post,

I'm also implementing repository pattern using Doctrine 2.

Something is bugging me, should the repository class has flush() method or not.

for example, I have save() method in my repository class.
Should I call flush in that method or call it from client code.

class xxxRepository
{
public function save($entity)
{
$this->em->persist($entity);
$this->em->flush();
}
}

in client code:
$xxxRepository->save($entity);

OR

pulling out flush() method from save() method and make it as new method in repository, so the client code will look like this:

$xxxRepository->save($entity)
$xxxRepository->flush();


The difference come when I need to save multiple entities,
in first alternative mean multiple call to database (multiple flush() called) and in later alternative its only once.

also it has relation with db transaction boundary, who should take responsibility to complete db transaction, repository class or client code ?

what do you think about this ?

thanks

Giorgio said...

In other languages, the implementations usually leave out transactional mechanisms like flush() and similar methods from the Repository. You actually can call it only one time, at the end of the PHP script/action, and the result is all the database interaction will happen at that moment, instead of having multiple queries performed at different times. An ORM like Doctrine 2 should be robust enough to allow this; if you encounter problems with some use cases (like exceptions thrown on flush) open a ticket on Doctrine 2 bug tracker.

dirsen said...

@Giorgio,
I got it now, thanks for you answer

ShareThis