Sunday, October 11, 2009

The value of abstractions

There is a constant trade-off between levels of abstraction and performance problems in software development and in other fields. We write code in high-level languages and not in assembly or machine code, but sometimes go down a level to write fast C extensions.
Every time you call a function, being it your own work or provided by a language or a framework, you are using an abstraction: is it worthwhile or you're wasting time writing bloated levels of indirection?

To understand the importance of abstractions in modern software, consider the process of retrieving a web page. When you type an url in the location bar of your browser, the following abstractions are taken advantage of:
  • first of all, the location bar, plus the form and inputs system, is an abstraction over the raw Hyper Text Transfer Protocol (HTTP). You don't write http requests by hand like GET /.
  • the http request and response are transported over a text flux based protocol which resides at a lower level of abstraction, the Transmission Control Protocol (TCP). The purpose of this abstraction is to free the high-level protocols from the burden of considering segmentation of messages.
  • the TCP layer then uses the underlying one, the Internet Protocol (IP), to deliver packets back and forth from your machine and the webserver, and to ensure their reliable transmission. IP abstracts away physical details by mapping internet hosts with numeric addresses, the famous ip addresses everyone is always talking about.
  • IP layer uses one or more data link layer protocols, such as Ethernet, to send frames of data between physical network points. These points are identified with Mac addresses in an Ethernet architecture.
  • While Ethernet is capable of physically transport data by changing the voltage over wires, it uses electronic circuits as black boxes that perform mathematical and logic calculations, like Cmos logic gates. These circuits present only binary voltage levels as their output and hide all the transistors they are constructed with.
That's a total of five levels of abstraction only to retrieve a web page. But there's more: every time you write a function or a class you are producing an abstraction. You hide the details of implementation as private members of a class and present a public Api which constitutes the abstraction considered. Moreover, if you design an object model, or a relational model, or every kind of model, you're abstracting away details from the real world. When writing the classic User class for a blogging application, you probably do not include the height or the eyes color of the user as a field.
Even assembler has procedures: multiple level of abstractions are present in every piece of software we encounter.

We have said that there is a trade-off in using abstractions: there are a lot of advantages in dealing with less data and a simplified model of the reality, built according to the desire of the abstraction user. This user can be an upper layer of abstraction or a real person. These advantages, however, come with a cost:
  • Performance cost. Every software layer has an overhead, which is time spent performing management operations and not business (read useful for the job at hand) ones. If you're calling an external method or function, you're pushing variables on the stack and passing the control to a subroutine: the cpu also must take the time to do allocate all local variables. More abstracted code is also prone to have a long execution complexity to consider every possible case it should manage. Multiplying these time costs for thousands of calls gives you the picture.
  • Limitations of the interface. When you want to do something the abstraction does not provide as a feature, you have to go down at a lower level and it's often a not pleasant activity.
  • Leaks. Sometimes an abstraction performs horribly if you do not take into consideration at all what it hides from your view. For instance, in Java the Remote Method Invocation feature lets you call methods on other phisycal machines objects, treating them as local instances. Imagine what happens when the connection is slow or too many calls are made...
The pros of abstraction, however, are usually by far more powerful than the cons:
  • Simplicity of the interface when there is no need to consider the underlying layer; think about the location bar of your browser again. This is a fundamental principle of software development, Divide et impera.
  • Standardization and reusing: the Tcp layer abstraction is used by all the upper level protocols, such as Ftp, Smtp, etc.; they do not have to implement their own low-level procedures.
  • Decoupling: the protocols from the upper layers and the high-level software components (if correctly designed with Dependency Injection) are decoupled from the low-level implementations. This is true for Ip which decouples browsers from knowing the voltage levels to transmit, and for Zend_Auth php objects which do not know if they are authenticating credentials against Ldap, a database table, or whatever.
The trade-off is present in nearly every component of a software application: a discussion started on my previous post on php profiling about whether it's right to use a framework and an Orm as abstractions on the php core capabilities, or to cut out the middle man and just access native php directly. In general, I am a fan of the abstraction choice, since the advantages it provides are concrete and immediate, while the disadvantages are only hypothetical. Maybe the performance will need optimization and caching; maybe the abstraction will leak details; but surely the developers work will be simpler than just using native php and the flexibility of the procuced code will be increased.
There is also the fear of overengineering an application. But don't let fear take control of your development process: do informed choices about using frameworks and external tools. Experiment and understand strong and weak points of different approaches before adopting or removing a level of abstraction.

One famous quote summarizes this post:
There is no problem in computer science which cannot be solved by one more level of indirection, except too many levels of indirection.


Rob Hofmeyr said...
This comment has been removed by the author.
Rob Hofmeyr said...

Nice post.

I agree completely. Unfortunately there are many cases where developers abstract (wrap) existing functions. I don't see why creating a Zend_Session object is any easier or different to using the global $_SESSION variable. There are numerous other examples present in the Zend Framework: $this->escape instead of htmlentities, Zend Validate instead of php/pecl filter etc.

Also, one needs to be especially careful with high levels of abstraction in interpreted languages like PHP. High level abstraction in compiled languages will have a negligible affect on performance once compiled, where the impact is far more severe when the script is being interpreted.

Writing my rebuttal to "5 reasons to use a framework" this evening :). Will let you know when it's posted.

Giorgio said...

Sute Rob, post the link in the comments and I'll read it. It's always nice to see different points of view.
For the $_SESSION case, and the other native functions, the problem is for example that they are not object-oriented and so they cannot be mocked out in testing. If I pass to a class of mine a Zend_Session_Namespace instance I can mock it and run thousands of test in the same process. If I use directly $_SESSION, I'm going into trouble in running more than one test and littering with global state my classes. This is valid for every kind of native function/feature and it is considered good practice to wrap it.