Invisible to the eye: 2015

Thursday, June 18, 2015

Property-based testing primer

I'm a great advocate of automated testing and of finding out your code does not work on your machine, 30 seconds after having written it, instead of in production after it has caused a monetary loss and some repair work to be performed. This is true for many different kind of testing, from the unit level (which has also benefits for internal quality and design feedback) to the acceptance level (which ensures the stakeholders get what they need and documents it for the future). Your System Under Test can be a single class, a project or even a collaboration of [micro]services accessed through HTTP from another process or machine.

However, classic test suites written with xUnit and BDD styles have some scaling problems they hit when you want to exercise more than some happy paths:

it is difficult to cover many different inputs by hand-writing test cases, so we stick with at most a dozen of cases for a particular method.
There are maintenance costs for every new input we want to test: each need some assertions to be written and be updated in the future if the System Under Test changes its API.
We tend not to test external dependencies such as the language or libraries, since we trust them to do a good job even if their failure is our responsibility. We chose them and we are deploying our project, not the original authors which provided the code "as is", without warranty.

Note that here I define input as the existing state of a system, plus the new input provided by a test (the Given and When part of a scenario) and output as not only the actual response produced by the System Under Test but also the new state it has assumed.

sort()

Let's take as an example a sort() function, which no one implements today except in job interviews and exercises.
Assuming an array (or list, depending on your language), we can produce several inputs for the function like we would do in a kata:

[1, 2, 3]
[3, 1]
[3, 6, 5, 1, 4]

and so on. When do we stop? Maybe we also need some tricky input:

[]
[1]
[1, 1]
[2, 3, 5, 6, 8, 9, 1, 3, 6, 7, 8, 9]

Once we have all the inputs gathered, we need to define what we expect for each of them:

[1, 2, 3] => [1, 2, 3]
[3, 1] => [1, 3[
[3, 6, 5, 1, 4] => [1, 3, 4, 5, 6]
[] => []
[1] => [1]
[1, 1] => [1, 1]
[2, 3, 5, 6, 8, 9, 1, 3, 6, 7, 8, 9] => [1, 2, 3, 3, 5, 6, 6, 7, 8, 8, 9, 9]

You can do this incrementally, growing the code one new test case at a time, but you have to do it anyway. Considering this can become boring and error-prone in this toy example makes you wonder what to do when instead of sort() you have a RangeOfMoney class which is key to your business and manages point-like and arbitrary intervals of monetary amounts (true story).

Property-based testing in a nutshell

Property-based testing in an approach to testing coming from the functional programming world. To solve the aforementioned problems (and get new, more interesting ones), it follows these steps:

generate a random sample of possible inputs.
Exercise the SUT with each of them.
Verify properties which should be true on every output instead of making precise comparisons.
(Optionally) if the properties verification failed, possibly shrink to find a minimal input that still causes a failure.

How does this work for the sort() function?

We can use rand() to generate an input array:
This array is composed by natural numbers (Gen\nat) and it is long up to 100 elements (Gen\pos(100)), since very long arrays could make our tests slow.

Then, for each of these inputs, we exercise sort() and verify a simple property on the output, which is the order of the elements:
This is not the only property that sort() maintains, but it's the first I would specify. There are possible others:

every element in the input is also in the output
every element in the output is also in the input
the length of the input and output arrays are the same.

Here is the complete example written using Eris, our extension for PHPUnit that automates the generation of many kinds of random data and their verification.

How to find properties?

How do we apply property-based testing to code we actually write every day? It certainly fits more in some areas of the code than in others, such as Domain Model classes.

Some rules of thumb for defining properties are:

look for inverse functions (e.g. addition and substraction, or doubling an image in size and shrinking it to 50%). You can use the inverse on the output and verify equality with the input.
Relate input and output on some property that is true or false on both (e.g. in the sort() example than an element that is in one of the two arrays is also in the other)
Define post conditions and invariants that always hold in a particular situation (e.g. in the sort() example that the output is sorted, but in general you can restrict the possible output values of a function very much saying it is an array, it contain only integers, its length is equal to the input's length.)

[2, 3, 5, 6, 8, 9, 1, 3, 6, 7, 8, 9] makes my test fail

Defining valid range of inputs with generators and the properties to be satisfied is a rich description of the behavior of the System Under Test. Therefore, when a sort() implementation fails we can work on the input in order to shrink it: trying to reduce its complexity and size in order to provide a minimal failing test case.

It's the same work we do when opening a bug report for someone else's code: we try to find a minimal combination that triggers the bug in order to throw away all unnecessary details that would slow down fixing it.

So in property-based testing the [2, 3, 5, 6, 8, 9, 1, 3, 6, 7, 8, 9] can probably be shrinked to [2, 3, 5, 6, 8, 9, 1, 3, 6, 7, 8] and maybe up to [1, 0], depending on the bug. This process is accomplished by trying to shrink all the random values generated, which in our case were the length of the array and the values contained.

Testing the language

So here's some code I expect to work:
This function creates a PHP DateTime instance using the native datetime extension, which is a standard for the PHP world. It starts from an year and a day number ranging from 0 to 364 (or 365) and it build a DateTime pointing to the midnight of that particular day.

Here is a property-based test for this function:
We generate two random integers in the [0. 364] range, and test that the difference in seconds of the two generated DateTime objects is equal to 86400 seconds multiplied by the number of the days passed between the two selected dates. A property of the input (distance) is maintained over the output in a different form (seconds instead of days).

Surprisingly, this test fails with the following message: what happened is we triggered a bug of the DateTime object while creating it with a particular combination of format and timezone. The net effect of this bug could have been that our financial reports (telling daily revenue) would have started showing the wrong numbers starting from February 29th of the next year.

Notice that the input is shrinked to the simplest possible values that trigger the problem: January 1st on one value and March 1st on the other.
Eventually we found a easy work around, as with a couple more lines of code we can avoid this behavior. We could do that only after discovering the bug of course.

In conclusion

Testing an application is a necessary burden for catching defects early and fix them with an acceptable cost instead of letting them run wild on real users. Property-based testing pushes automation also in the generation of inputs for the System Under Test and in the verification of results, hoping to lower the maintanance cost while increasing coverage at the same time.

Given the domain complexity handled by the datetime extension, it's doing a fantastic job and it's being developed by very competent programmers. Nevertheless, if they can slip in bugs I trust that my own code will, too. Property-based testing is an additional tool that can work side by side with example-based testing to uncover problems in our projects.

We named the property-based PHPUnit extension after Eris, the Greek goddess of chaos, since serious testing means attacking your code and the platform it is built on in the attempt of breaking it before someone else does.

References

Eris on Packagist
Eris examples on GitHub
The tools that inspired Eris from other languages: QuickCheck, ScalaCheck, Erlang QuickCheck.

Wednesday, May 27, 2015

Eris 0.4.0 is out

Eris is a PHPUnit extension for property-based testing, that is to say testing based on generating inputs to a system and check its state and output respect a set of properties. The project is a porting of QuickCheck and Eris is the name of the Greek goddess of chaos, since its aim is to break the System Under Test with unexpected inputs and actions.

I am planning a talk at the PHP User Group Milano and a longer blog post to introduce the general public to how property-based testing works. I held the same talk for a few friends at the phpDay 2015 Unconference.

Meanwhile version 0.4.0 is out, with the following ChangeLog:

Showing generated input with ERIS_ORIGINAL_INPUT=1.
names and date (DateTime) new Generator.
tuple Generator supports variadic arguments.
Shrinking respects when() clauses.
Dates and sorting examples.

As for all semantic-versioned projects, the 0.x series should be considered alpha and no API backward compatibility is guaranteed on update.

Image credits

Sunday, March 29, 2015

Evolution in software development

Evolution can be cited as a metaphor for iterative development: every new iteration (or commit at an ideal smallest scale) produces a shippable, new version of the software that has evolved from the previous one.

Really simplifying the process, and skipping some steps for simplicity, we can see something like:

you start from a unicellular organism
which evolves (slowly) into a multicellular organism
which evolves into a fish
which evolves into a mammal such as a primate and then into Homo Sapiens

as a metaphor for:

you start from a Walking Skeleton
you add the features necessary to get to a Minimum Viable Product
you add, modify and drop features to tackle one part of a hierarchical backlog and get to some business objective.
you pick another area of the backlog to continue working

Every metaphor has its limits: for example not all software descends from a common ancestor; variations in the new generations of software are not necessarily produced by randomness nor selected by reproductive ability of the software itself.
Still, we can take some lessons where patterns observed over million of years of evolution apply to a software development scenario.

If by any chance you happen to be a creationist it's better for you to stop reading this post.

Lesson: each step has to keep working

My father is a human, working organism. His father was too - and also the mother of his father. We all descend from an unbroken line of ancestors who were capable of staying alive and, during their lifespans, reproduce. This line goes up until ancestors 4 billion years ago who were unicellular organisms.
Thus evolution has to keep all intermediate steps working. Evolution cannot produce current versions who do not have value in the hope that they can be turned into something valuable later.
Value here is not necessarily money as it can be a user base or market share that can be monetized later: the investors market puts a price tag on a large user base but often not on a sound software architecture.
In fact, this lesson is a reality in capitalism-based companies whose funding is received through demonstrating business results; again, not necessarily profitability but acquisition, retention, growth metrics (Twitter) or revenue (Amazon):

No wonder tech is going the way it's going. People need something tangible in 3 months or no funding http://t.co/f370JW4C9W

— Alvaro 阿尔瓦罗 (@old_sound) March 21, 2015

Short-term capital investments are a very common business model in companies adopting Agile software development. They're not the only possible model to develop wealth: the Internet and GPS were created through generous public funding by the US government (which had its own military reasons).

Lesson: evolution takes a long time

If you take a look at the timeline of evolution on Earth, you'll see that it took a long time for more and more complex organisms to appear:

for the last 3.6 billion years, simple cells (prokaryotes);
for the last 300 million years, reptiles;
for the last 200 million years, mammals;
for the last 130 million years, flowers;
for the last 60 million years, the primates,
for the last 20 million years, the family Hominidae (great apes);
for the last 2.5 million years, the genus Homo (including humans and their predecessors);
for the last 200,000 years, anatomically modern humans.

We can argue that flowers are simpler than dinousaurs, and I'm going to address this later, but we widely believe that modern age humans express the most complex abilities ever seen on Earth: speaking, reading, writing, and tool-building. It can take a long time too for a complex software and an adequate design to emerge purely from iteration.

Lesson: we may not be capable of anything else

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system. – John Gall

Evolution has been wonderful at producing animals that can run while for robotics it was an hard problem to control artificial legs on disparate terrains compared to ordinary computation.
But, if we believe Gall's law, there may not be another way to get to complex systems than to pass from simple systems first. In a sense, robotics engineers have not designed robot dogs from scratch but evolved them from simpler artificial beings at a faster pace than natural selection.
Exercises on evolving a product in thin slices such as the Elephant carpaccio are often made fun of because of the triviality of the slices in a controlled exercise: "add an input field", "support two discounts instead of one". However, in a real environment the slices become more "launch the product in France for this 10 million people segment" and "launch for another segment of the population". To believe we can design the perfect solution and never slice the problem into steps really is a God complex.

Lesson: local minima are traps

Humans and in general vertebrates have variants of the recurrent laryngeal nerve, which connects the brain to the larynx or an equivalent organ. This nerve developed in fishes, where it took a short route around the main artery exiting from the heart. It still takes this route today in humans.
However, giraffes have undergone a series of local optimizations (boy-scout rule anyone?) where the giraffes with longer necks outcompeted the shorter ones, raising the average length of their necks up to what we see today. The nerve grew with the neck as the only possible local modification that keep the new giraffe capable of living and reproducing itself was to lengthen the nerve, not to redesign it.
Longer nerves have a cost: the more cells they are composed of, the easier it is for them to develop diseases or to suffer injuries. An engineer would never do such a design mistake when trying to build a giraffe.
It is open to speculation whether it is possible to develop a giraffe with a nerve taking a shorter route: an engineer would surely try to simplify the DNA with this large refactoring.
But there may be, for example, embryological reasons that prevent giraffes from being able to be grown from an embryo if the nerve takes a different route. We often estimate the time for a new feature as a critical path where nothing goes wrong, but what if you have to take offline and migrate your database for hours in order to deploy this brand new, clean object-oriented design? Your perfectly designed giraffe may be destined to be stillborn.
Nature isn't bad or good: it just is. Local minima are a fact of life and software. We should take care not to preach local improvements as the silver bullet solution, nor to jump into too large refactorings which will kill the giraffe.

Lesson: you don't necessarily get better

http://en.wikipedia.org/wiki/Mosquito#/media/File:Mosquito_2007-2.jpg

Fitness functions describe how well-adapted an organism is to its environment, and evolutionary pressure selects the organisms with the better fitness to survive for the next generations. Even when selective pressure is very slightly skewed in one direction, over many generations a trait may very well disappear or appear due to a cumulative effect.
Depending on your choice for fitness, you may get a different result. Modern-day mosquitoes evolved from the same tree of life as humans, so our evolutionary projects can either become humans, elephants, cats, cockroaches or mosquitoes depending on which direction the forces in play are selecting.
In fact, there are so many legacy code projects around that you wonder what has produced them. One line at the time, they evolved into monsters to satisfy the fitness function of external quality: add this feature, make more money, save server resources, return a response to the user faster.
Worse is better is another example of philosophy where for software is more important to be simple than correct. Simple software is produced and spreads faster than cathedral-like software, taking advantage then of network effects such as the newly created community to improve. We have seen that at work with C and x86 (and so many other standards), we are seeing it now with JavaScript and MongoDB.
Again, I'm not saying this is good or bad: I'm saying this is happening and that if you want to replace JavaScript with a better language you have to take this into account instead of complaining about it. One wonders how many extinct animals we have never heard about, which leads us to the last lesson of evolution.

Lesson: extinction is around the corner

Much of this article is a deconstruction of iterative development, as a way to swing the pendulum on the other side of the argument for once. There has to be a devil's advocate.
I will leave you instead with a final point on why iterative development is a crucial part of Agile. After all, even this post is iterative: written, edited and reviewed at separate times, I didn't write it with a pen and paper in fact but on a digital medium very easy to modify.

More than 99 percent of all species that ever lived on the planet are estimated to be extinct -- Extinction on Wikipedia

Why a species goes extinct? It may evolve into another species, but this happens very slowly. What usually happens is the member of a species are no longer able to survive in changing conditions or against superior competition. Which sounds like something extracted from a management book.

In fact, one of the fundamental maxims of Agile software design is to keep software easy to change. Resist the over-optimization of today to survive and thrive tomorrow, as we can't foresee the future but we can foresee that we will likely have to change.

Invisible to the eye