Tuesday, February 02, 2016

Book review: Java Puzzlers

Java Puzzlers is a nice book on the corner cases of the Java language, taken as an excuse to explain its inner workings. I have read it over the holidays and I found it a nice refresher over the perils of some of Java's features and types.

What you'll learn

A first example of feature under scrutiny is the casting of primitive types to each other (byte, char int), with the related overflows and unintended loss of precision. Getting to one of these conditions can result in an infinite loop, so it's definitely something you want to be able to debug in case it happens.
There are more esotheric corner cases in the book, such as the moment in which i == i returns false; or the problem of initializations of classes and objects not happening in the order you expect.

Method

I don't want to spoil the book, since it's based on trying to run the examples yourself and propose an explanation for the observed behavior. I'd say this scientific method is very effective in keeping the reader engaged, to the point that I finished it in a few days.
To run the puzzles, you work on self-contained .java files provided by the books website; you can actually compile and run them from the terminal as there are no external dependencies.
Incidentally, this isolation also means you're working on the language and in some cases on the standard library; the knowledge you acquire won't be lost after the next release of an MVC framework.
Again, work on the pc and not on a printed book only; in that case, you won't be able to experiment and you will be forced to read solutions instead of try and solving the puzzles yourself.

Argh, Java!

It's easy to fall into another pitfall: bashing the Java language and its libraries. The authors, however, make clear that the goal of the book is just to explain the corner cases of the platform so that programmers can be more productive. Java has been a tremendously successful platform all over the world, the strange behaviors shown here notwithstanding.
Thus instead of saying This language sucks you can actually think If this happens in my production code I will know what to do. The biggest lesson of the book is to keep the code simple, consistent and readable so that there are no ticking time bombs or hidden, open manholes in a dark and stormy night.

Some sample quotes

Clearly, the author of this program didn’t think about the order in which the initialization of the Cache class would take place. Unable to decide between eager and lazy initialization, the author tried to do both, resulting in a big mess. Use either eager initialization or lazy initialization, never both.
the lesson of Puzzle 6: If you can’t tell what a program does by looking at it, it probably doesn’t do what you want. Strive for clarity.
Whenever the libraries provide a method that does what you need, use it [EJ Item 30]. Generally speaking, the libraries provides high-quality solutions requiring a minimum of effort

Conclusions

Some of these pitfalls may be found through testing; some will only be found when the code is heavily exercised in a production environment; some may remain traps never triggered. Yet, having this knowledge will make you able to understand what's happening in those cases instead of just looking for an unpleasant workaround. This is also a recreational book for programmers, so if you're working with Java get to know it with an easy and fun read.

Monday, January 25, 2016

Book review: Java Concurrency In Practice

I just finished reading the monumental book Java Concurrency in Practice, the definitive guide to writing concurrent programs in Java from Brian Goetz at al. This books gives you lots of information in a single easy place to find, so I'll delve immediately into describing what can you learn from it.

A small distributed system

On modern processor architectures, multithreading and concurrency have in general become a small distributed system inside a motherboard, spanning the centimeters that separate the CPU cores and the RAM.
In fact, you can see many parallels between the two field: CPUs are different machines, and coordinating between them is relatively more costly than allowing independent executions. The L1, L2 and L3 caches near the CPU cores behave as replicas, showing tunable consistency models and forcing compilers to introduce synchronization where needed.
Moreover, partial failure is always round the corner as threads run independently. Forcing programmers to deal with possible failure is one of the few usages of checked exceptions that I find not only acceptable but also desirable. I tend not to like checked exceptions too much as they tend to be replicated in too many places in the code, creating coupling. Still, they make forgetting about a possible thread interruption harder and also push for isolating the concurrent code from the domain models it is using underneath, to avoid throws clause contaminations.

Relevant JVM topics

The book is ripe with Java Virtual Machine concurrency concepts, building a pattern language for talking about thread safety and performance (which are the goals we are pursuing with concurrent applications.) Java's model is based on multithreading and shared memory, where the virtual threads are mapped 1:1 over the OS threads:
  • thread safety is based on confinement, atomicity, and visibility. These are not generic terms but are really concrete, explained with many code samples.
  • Publication and synchronization makes threads communicate, and immutable objects help keeping the collaboration simple. Immutability is not just a conceptual suggestion, because the JVM actually behaves differently when final fields are in place.
  • Every concept boils down to an explanation built over the underlying Java Memory Model, a specification that JVMs have to respect when implementing primitive operations.

Libraries

Basic concepts are necessary for understanding what's going on in your VM, but they are an insufficient level of abstraction for productive work. For this reason, the book explains the usage of several standard libraries:
  • synchronized data structures and their higher performance. ConcurrentHashMap is a work of art as it implements lock striping to avoid coordination when accessing different buckets in the map.
  • The Executor framework provides thread pools, futures, task cancellation and clean shutdown. Creating threads by hand is a beginner's solution.
  • Middleware such as latches, semaphores, and barriers to coordinate threads and stop them from clashing with each other without having to manually write synchronized blocks all over the place.
Thus part of the book has an emphasis of using the best tools available in Java SE instead of reinventing the wheel with Object.wait() and Object.notifyAll(), which are still explained thoroughly in the advanced chapters. Reinventing the wheel can be an error-prone task that produces inferior results, and it should not be the only option just because it's the only approach you know.

Be aware that...

The book is updated to Java 6 (it's missing the Fork/Join framework for example), but fortunately this version contains much of what you need on the theory and basic libraries. You will still need to integrate this knowledge with Java 8 parallel streams.
It takes focus to get through this book, and I spent several dozen hours to read the 16 chapters.
The annotations (such as @GuardedBy) won't compile if you don't download a separate package; it's too bad they're not a standard, since the authors are luminaries of the Java concurrency field, experts from many JSR groups and Java programming language authors.
As always for very technical books, I suggest to read it on a pc, with your preferred IDE and JUnit open to write tests and experiment with what you are learning. You probably will need some review on the most difficult topics, just to hear them as explained from different people. Stack Overflow and many blog articles will be your friend as you look for examples of unsafe publication or of the Java Memory Model.

Conclusions

I'm a fan of getting to the bottom of how things do work (and don't). I would definitely recommend this book if you are executing your code in multiple threads, as sooner or later you will be bitten without even understanding what went wrong. Even if you're just writing a Servlet, that code could become a target for concurrency.
Moreover, as for distributed systems, in concurrency simple testing is not enough: problems can be hard to find and combinatorially difficult to reproduce. You need theory, code review, static analysis: this book is one of the tools that can help you avoiding pesky bugs and much wasted time.

Sunday, January 10, 2016

Book review: Thinking in Java

I recently read the 1000-page tome Thinking in Java, written by Bruce Eckel, with the goal of getting my feet wet in the parts of the language that were still obscure to me. Here is my review of the book, containing its strong and weak points.

Basic topics

This is a book that touches on every basic aspect of the Java language, keeping an introductory level throughout but delving into deep usage of the Java standard libraries when treating a specific subject.
The basic concepts are well covered by the first 200 pages:
  • primitive values, classes and objects, control structures, and operators.
  • Access control: private, package, protected, public for classes, methods and fields.
  • Constructors and garbage collection.
  • Polymorphism and interfaces that enable it.
Most of the basic topics are oriented to programmers coming from a C procedural background, so they don't dwell on syntax but instead focus on the semantics and the JVM memory model.
Even if you come from a modern dynamic language, you will still find this first part useful to intimately understand how the language works. You will learn common idioms and patterns such as delegating to parent methods, getting a feel for Java code instead of trying to write Ruby code in a Java environment.

Wide coverage

The larger part of the book instead will be useful to cover new ground, if your knowledge is lacking in some areas or if you want a complete understanding of it. For example, I have a good knowledge of data structures such as ArrayList, HashMap and HashSet; still the 90 pages on the Java collections framework introduced me to structures such as the PriorityQueue whose usage is infrequent but can be very useful when you encounter a problem that calls for them.
Here is a full list of the specific topics treated in the book:
  • Inner static and non-static classes.
  • Exceptions management with try/catch/finally blocks.
  • Generics and all their advanced cases including example such as class SelfBounded<T extends SelfBounded<T>>.  I thought I knew generics until I read this chapter.
  • The peculiarities of arrays, still present in some parts of the language such as method calls with a variable number of arguments.
  • The Java collections framework, much more than List, Set and Map; lots of different implementations and a conceptual map of all the interfaces and classes.
  • Input/Output at the byte and text level, plus part of the java.nio evolution.
  • Enumerated types.
  • Reflection and annotations (definition and usage).
By the way, a few of the chapters can safely be skipped to make the book shorter. Drop the graphical user interfaces chapter as totally outdated, I don't even write user interfaces different without HTML anymore nowadays. probably the most popular GUI framework right now is the Android Platform rather than what's described here.
I also suggest to skip the concurrency and threading chapter, since such a small treatment cannot do justice to this topic. I would prefer another dedicated introduction and then go on with a more advanced book like Java Concurrency in Practice, which will tell you also what not to do instead of showing all the language features.
On this point, I find the writing of Bruce Eckel conservative, showing caution with advanced and obscure features rather than showing off with the risk of writing unmaintainable code down the line. The point is making you able to read complex Java code, not enabling you to write a mess more quickly.

Style

The book is quite lengthy, but lets you select a subset of the chapters pretty well if you need to dig into a particular topic. The text is driven by code samples, and to experimenting instead of reciting theory.
I suggest to read a digital version with your IDE ready: at least in my case, I found it easier to pick up concepts and get involved by writing my own examples. A 1000-page book would be pretty daunting if read on a device with no interaction, as it's the polar opposite of dense.
I created many test cases like this one, which lead me to verify the assumed behavior of Java libraries and features with my own hands:

Currency

The drawback of this book is its not being up-to-date with the current version of the platform, Java 8. The most recent version is the 4th edition, available on Amazon since 2006, which treats every feature up to Java 5.
You will have to piece together knowledge of Java 8 and the intermediate versions from other sources. I would have expected this book to at least be up-to-date with Java 7 due to its popularity.
However, due to Java's backward compatibility, what you read is still correct: I only found one code sample to have a compilation problem. I wonder how long would this book be if it was edited again to include Java 8: it could probably get to 1500 pages or more and implode under its own weight.

Conclusions

If you work with Java, Thinking in Java is a must-read, either to get a quick introduction to the basic features or to delve into one of the specific areas when you need it. You will probably never be surprised by reading Java syntax or idioms again. However, don't expect a complete coverage of such a huge world: this should be your first Java book, not the last.

Monday, January 04, 2016

PHPUnit_Selenium 2.0.0 is out

Here is the text of the change I have just merged to make a new major version of PHPUnit_Selenium a reality:
As signaled in #351, there are incompatibilities between the current version of PHPUnit_Selenium and PHPUnit 5.
It is a losing proposition to still support Selenium 1 API (SeleniumTestCase), as it redefines methods that have even changed signatures. It has not been maintained for years.
So to support PHPUnit 5.x there will be a new major version of this project, 2.x. The old 1.x branch will remain available but not updated anymore.
2.x will contain:
  • Selenium2TestCase
and work with PHPUnit 4.x or 5.x, with correspondent PHP versions.
1.x will contain:
  • SeleniumTestCase
  • Selenium2TestCase
but will only work with PHPUnit 4.x, with correspondent PHP versions. In general, it will not be updated anymore. 
Supported PHP versions vary from 5.3 (!) to 5.6, according to the PHPUnit's version requirement.
Installation is available through Composer, as before.

Saturday, January 02, 2016

Building an application with modern Java technologies

Sometimes Java gets a bad rap from Agile software developers, who suspect it to be a legacy technology on par of Cobol. I understand that being the most used language on the planet means there must be projects of any kind out there, including lots of legacy. That said, Java and the JVM are a huge and alive ecosystem producing value in all kind of domains from banking to Android applications and scientific software.

So I built a sample Game Of Life with what I believe are modern Java tecnologies:
  • Java 8 with lambdas support
  • Gradle for build automation and dependency resolution (substitutes both Ant and Maven)
  • Jetty as an embedded server to respond to HTTP requests
  • Jersey for building the RESTful web service calculating new generations of a plane, using JAX-RS
  • Freemarker templating engine to build HTML
  • Log4j 2 for logging, encapsulated behind the interface slf4j.
On the testing side of things:
Here's the result:
The application is self-contained and can be compiled, tested and run without any IDE or previous machine setup except having a JDK 8 on your machine. See the project on Github for details.

Thursday, June 18, 2015

Property-based testing primer

I'm a great advocate of automated testing and of finding out your code does not work on your machine, 30 seconds after having written it, instead of in production after it has caused a monetary loss and some repair work to be performed. This is true for many different kind of testing, from the unit level (which has also benefits for internal quality and design feedback) to the acceptance level (which ensures the stakeholders get what they need and documents it for the future). Your System Under Test can be a single class, a project or even a collaboration of [micro]services accessed through HTTP from another process or machine.

However, classic test suites written with xUnit and BDD styles have some scaling problems they hit when you want to exercise more than some happy paths: 
  • it is difficult to cover many different inputs by hand-writing test cases, so we stick with at most a dozen of cases for a particular method.
  • There are maintenance costs for every new input we want to test: each need some assertions to be written and be updated in the future if the System Under Test changes its API.
  • We tend not to test external dependencies such as the language or libraries, since we trust them to do a good job even if their failure is our responsibility. We chose them and we are deploying our project, not the original authors which provided the code "as is", without warranty.
Note that here I define input as the existing state of a system, plus the new input provided by a test (the Given and When part of a scenario) and output as not only the actual response produced by the System Under Test but also the new state it has assumed.

sort()


Let's take as an example a sort() function, which no one implements today except in job interviews and exercises.
Assuming an array (or list, depending on your language), we can produce several inputs for the function like we would do in a kata:
  • [1, 2, 3]
  • [3, 1]
  • [3, 6, 5, 1, 4]
and so on. When do we stop? Maybe we also need some tricky input:
  • []
  • [1]
  • [1, 1]
  • [2, 3, 5, 6, 8, 9, 1, 3, 6, 7, 8, 9]
Once we have all the inputs gathered, we need to define what we expect for each of them:
  • [1, 2, 3] => [1, 2, 3]
  • [3, 1] => [1, 3[
  • [3, 6, 5, 1, 4] => [1, 3, 4, 5, 6]
  • [] => []
  • [1] => [1]
  • [1, 1] => [1, 1]
  • [2, 3, 5, 6, 8, 9, 1, 3, 6, 7, 8, 9]  => [1, 2, 3, 3, 5, 6, 6, 7, 8, 8, 9, 9]
You can do this incrementally, growing the code one new test case at a time, but you have to do it anyway. Considering this can become boring and error-prone in this toy example makes you wonder what to do when instead of sort() you have a RangeOfMoney class which is key to your business and manages point-like and arbitrary intervals of monetary amounts (true story).

Property-based testing in a nutshell

Property-based testing in an approach to testing coming from the functional programming world. To solve the aforementioned problems (and get new, more interesting ones), it follows these steps:
  1. generate a random sample of possible inputs.
  2. Exercise the SUT with each of them.
  3. Verify properties which should be true on every output instead of making precise comparisons.
  4. (Optionally) if the properties verification failed, possibly shrink to find a minimal input that still causes a failure.

How does this work for the sort() function?

We can use rand() to generate an input array:
This array is composed by natural numbers (Gen\nat) and it is long up to 100 elements (Gen\pos(100)), since very long arrays could make our tests slow.

Then, for each of these inputs, we exercise sort() and verify a simple property on the output, which is the order of the elements:
This is not the only property that sort() maintains, but it's the first I would specify. There are possible others:
  • every element in the input is also in the output
  • every element in the output is also in the input
  • the length of the input and output arrays are the same.
Here is the complete example written using Eris, our extension for PHPUnit that automates the generation of many kinds of random data and their verification.

How to find properties?

How do we apply property-based testing to code we actually write every day? It certainly fits more in some areas of the code than in others, such as Domain Model classes.

Some rules of thumb for defining properties are:
  • look for inverse functions (e.g. addition and substraction, or doubling an image in size and shrinking it to 50%). You can use the inverse on the output and verify equality with the input.
  • Relate input and output on some property that is true or false on both (e.g. in the sort() example than an element that is in one of the two arrays is also in the other)
  • Define post conditions and invariants that always hold in a particular situation (e.g. in the sort() example that the output is sorted, but in general you can restrict the possible output values of a function very much saying it is an array, it contain only integers, its length is equal to the input's length.)

[2, 3, 5, 6, 8, 9, 1, 3, 6, 7, 8, 9] makes my test fail

Defining valid range of inputs with generators and the properties to be satisfied is a rich description of the behavior of the System Under Test. Therefore, when a sort() implementation fails we can work on the input in order to shrink it: trying to reduce its complexity and size in order to provide a minimal failing test case.

It's the same work we do when opening a bug report for someone else's code: we try to find a minimal combination that triggers the bug in order to throw away all unnecessary details that would slow down fixing it.

So in property-based testing the [2, 3, 5, 6, 8, 9, 1, 3, 6, 7, 8, 9] can probably be shrinked to [2, 3, 5, 6, 8, 9, 1, 3, 6, 7, 8] and maybe up to [1, 0], depending on the bug. This process is accomplished by trying to shrink all the random values generated, which in our case were the length of the array and the values contained.

Testing the language

So here's some code I expect to work:
This function creates a PHP DateTime instance using the native datetime extension, which is a standard for the PHP world. It starts from an year and a day number ranging from 0 to 364 (or 365) and it build a DateTime pointing to the midnight of that particular day.

Here is a property-based test for this function:
We generate two random integers in the [0. 364] range, and test that the difference in seconds of the two generated DateTime objects is equal to 86400 seconds multiplied by the number of the days passed between the two selected dates. A property of the input (distance) is maintained over the output in a different form (seconds instead of days).

Surprisingly, this test fails with the following message: what happened is we triggered a bug of the DateTime object while creating it with a particular combination of format and timezone. The net effect of this bug could have been that our financial reports (telling daily revenue) would have started showing the wrong numbers starting from February 29th of the next year.

Notice that the input is shrinked to the simplest possible values that trigger the problem: January 1st on one value and March 1st on the other.
Eventually we found a easy work around, as with a couple more lines of code we can avoid this behavior. We could do that only after discovering the bug of course.

In conclusion

Testing an application is a necessary burden for catching defects early and fix them with an acceptable cost instead of letting them run wild on real users. Property-based testing pushes automation also in the generation of inputs for the System Under Test and in the verification of results, hoping to lower the maintanance cost while increasing coverage at the same time.

Given the domain complexity handled by the datetime extension, it's doing a fantastic job and it's being developed by very competent programmers. Nevertheless, if they can slip in bugs I trust that my own code will, too. Property-based testing is an additional tool that can work side by side with example-based testing to uncover problems in our projects.

We named the property-based PHPUnit extension after Eris, the Greek goddess of chaos, since serious testing means attacking your code and the platform it is built on in the attempt of breaking it before someone else does.

References

Wednesday, May 27, 2015

Eris 0.4.0 is out

Eris is a PHPUnit extension for property-based testing, that is to say testing based on generating inputs to a system and check its state and output respect a set of properties. The project is a porting of QuickCheck and Eris is the name of the Greek goddess of chaos, since its aim is to break the System Under Test with unexpected inputs and actions.

I am planning a talk at the PHP User Group Milano and a longer blog post to introduce the general public to how property-based testing works. I held the same talk for a few friends at the phpDay 2015 Unconference.

Meanwhile version 0.4.0 is out, with the following ChangeLog:
  • Showing generated input with ERIS_ORIGINAL_INPUT=1.
  • names and date (DateTime) new Generator.
  • tuple Generator supports variadic arguments.
  • Shrinking respects when() clauses.
  • Dates and sorting examples.

As for all semantic-versioned projects, the 0.x series should be considered alpha and no API backward compatibility is guaranteed on update.

Image credits

Sunday, March 29, 2015

Evolution in software development

Evolution can be cited as a metaphor for iterative development: every new iteration (or commit at an ideal smallest scale) produces a shippable, new version of the software that has evolved from the previous one.

Really simplifying the process, and skipping some steps for simplicity, we can see something like:
  1. you start from a unicellular organism
  2. which evolves (slowly) into a multicellular organism
  3. which evolves into a fish
  4. which evolves into a mammal such as a primate and then into Homo Sapiens
as a metaphor for:
  1. you start from a Walking Skeleton
  2. you add the features necessary to get to a Minimum Viable Product
  3. you add, modify and drop features to tackle one part of a hierarchical backlog and get to some business objective.
  4. you pick another area of the backlog to continue working
Every metaphor has its limits: for example not all software descends from a common ancestor; variations in the new generations of software are not necessarily produced by randomness nor selected by reproductive ability of the software itself.
Still, we can take some lessons where patterns observed over million of years of evolution apply to a software development scenario.

If by any chance you happen to be a creationist it's better for you to stop reading this post.

Lesson: each step has to keep working

My father is a human, working organism. His father was too - and also the mother of his father. We all descend from an unbroken line of ancestors who were capable of staying alive and, during their lifespans, reproduce. This line goes up until ancestors 4 billion years ago who were unicellular organisms.
Thus evolution has to keep all intermediate steps working. Evolution cannot produce current versions who do not have value in the hope that they can be turned into something valuable later.
Value here is not necessarily money as it can be a user base or market share that can be monetized later: the investors market puts a price tag on a large user base but often not on a sound software architecture.
In fact, this lesson is a reality in capitalism-based companies whose funding is received through demonstrating business results; again, not necessarily profitability but acquisition, retention, growth metrics (Twitter) or revenue (Amazon):
Short-term capital investments are a very common business model in companies adopting Agile software development. They're not the only possible model to develop wealth: the Internet and GPS were created through generous public funding by the US government (which had its own military reasons).

Lesson: evolution takes a long time

If you take a look at the timeline of evolution on Earth, you'll see that it took a long time for more and more complex organisms to appear:
  • for the last 3.6 billion years, simple cells (prokaryotes);
  • for the last 300 million years, reptiles;
  • for the last 200 million years, mammals;
  • for the last 130 million years, flowers;
  • for the last 60 million years, the primates,
  • for the last 20 million years, the family Hominidae (great apes);
  • for the last 2.5 million years, the genus Homo (including humans and their predecessors);
  • for the last 200,000 years, anatomically modern humans.
We can argue that flowers are simpler than dinousaurs, and I'm going to address this later, but we widely believe that modern age humans express the most complex abilities ever seen on Earth: speaking, reading, writing, and tool-building. It can take a long time too for a complex software and an adequate design to emerge purely from iteration.

Lesson: we may not be capable of anything else

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system. – John Gall
Evolution has been wonderful at producing animals that can run while for robotics it was an hard problem to control artificial legs on disparate terrains compared to ordinary computation.
But, if we believe Gall's law, there may not be another way to get to complex systems than to pass from simple systems first. In a sense, robotics engineers have not designed robot dogs from scratch but evolved them from simpler artificial beings at a faster pace than natural selection.
Exercises on evolving a product in thin slices such as the Elephant carpaccio are often made fun of because of the triviality of the slices in a controlled exercise: "add an input field", "support two discounts instead of one". However, in a real environment the slices become more "launch the product in France for this 10 million people segment" and "launch for another segment of the population". To believe we can design the perfect solution and never slice the problem into steps really is a God complex.

Lesson: local minima are traps

Humans and in general vertebrates have variants of the recurrent laryngeal nerve, which connects the brain to the larynx or an equivalent organ. This nerve developed in fishes, where it took a short route around the main artery exiting from the heart. It still takes this route today in humans.
However, giraffes have undergone a series of local optimizations (boy-scout rule anyone?) where the giraffes with longer necks outcompeted the shorter ones, raising the average length of their necks up to what we see today. The nerve grew with the neck as the only possible local modification that keep the new giraffe capable of living and reproducing itself was to lengthen the nerve, not to redesign it.
Longer nerves have a cost: the more cells they are composed of, the easier it is for them to develop diseases or to suffer injuries. An engineer would never do such a design mistake when trying to build a giraffe.
It is open to speculation whether it is possible to develop a giraffe with a nerve taking a shorter route: an engineer would surely try to simplify the DNA with this large refactoring.
But there may be, for example, embryological reasons that prevent giraffes from being able to be grown from an embryo if the nerve takes a different route. We often estimate the time for a new feature as a critical path where nothing goes wrong, but what if you have to take offline and migrate your database for hours in order to deploy this brand new, clean object-oriented design? Your perfectly designed giraffe may be destined to be stillborn.
Nature isn't bad or good: it just is. Local minima are a fact of life and software. We should take care not to preach local improvements as the silver bullet solution, nor to jump into too large refactorings which will kill the giraffe.

Lesson: you don't necessarily get better

http://en.wikipedia.org/wiki/Mosquito#/media/File:Mosquito_2007-2.jpg

Fitness functions describe how well-adapted an organism is to its environment, and evolutionary pressure selects the organisms with the better fitness to survive for the next generations. Even when selective pressure is very slightly skewed in one direction, over many generations a trait may very well disappear or appear due to a cumulative effect.
Depending on your choice for fitness, you may get a different result. Modern-day mosquitoes evolved from the same tree of life as humans, so our evolutionary projects can either become humans, elephants, cats, cockroaches or mosquitoes depending on which direction the forces in play are selecting.
In fact, there are so many legacy code projects around that you wonder what has produced them. One line at the time, they evolved into monsters to satisfy the fitness function of external quality: add this feature, make more money, save server resources, return a response to the user faster.
Worse is better is another example of philosophy where for software is more important to be simple than correct. Simple software is produced and spreads faster than cathedral-like software, taking advantage then of network effects such as the newly created community to improve. We have seen that at work with C and x86 (and so many other standards), we are seeing it now with JavaScript and MongoDB.
Again, I'm not saying this is good or bad: I'm saying this is happening and that if you want to replace JavaScript with a better language you have to take this into account instead of complaining about it. One wonders how many extinct animals we have never heard about, which leads us to the last lesson of evolution.

Lesson: extinction is around the corner

Much of this article is a deconstruction of iterative development, as a way to swing the pendulum on the other side of the argument for once. There has to be a devil's advocate.
I will leave you instead with a final point on why iterative development is a crucial part of Agile. After all, even this post is iterative: written, edited and reviewed at separate times, I didn't write it with a pen and paper in fact but on a digital medium very easy to modify.
More than 99 percent of all species that ever lived on the planet are estimated to be extinct -- Extinction on Wikipedia

Why a species goes extinct? It may evolve into another species, but this happens very slowly. What usually happens is the member of a species are no longer able to survive in changing conditions or against superior competition. Which sounds like something extracted from a management book.

In fact, one of the fundamental maxims of Agile software design is to keep software easy to change. Resist the over-optimization of today to survive and thrive tomorrow, as we can't foresee the future but we can foresee that we will likely have to change.

Sunday, September 21, 2014

Microservices are not Jars

The cult of the monolith
I've been building microservices for two years and my main complaint is that they're still not micro- enough. Here's a rebuke of Uncle Bob's recent post Microservices and Jars, which he apparently has written after forming an opinion based on an article in Martin Fowler's bliki:
One of my clients recently told me that they were investigating a micro-service-architecture. My first reaction was: "What's that?" So I did a little research and found Martin Fowler's and James Lewis' writeup on the topic.
 "I didn't even know what microservices were up until several days ago. Now I'm ready to pontificate about the topic."
So what is a micro-service? It's just a little stand-alone executable that communicates with other stand-alone executables through some kind of mailbox; like an http socket. Lots of people like to use REST as the message format between them.
Why is this desirable? Two words. Independent Deployability.
Let's ignore the REST as the message format terminology. Only two words? Independent deployability is nice, but I've seen cases where independence is total, and cases where an end-to-end test suite still needs to run including the production version of services A and B and the new version C' that we want to deploy to substitute C.
Other interesting properties of microservices such as scaling them independently come to mind. Or writing them in different languages. Or adapting to Conway's law by aligning teams with microservices for most of their work.
You can fire up your little MS and talk with it via REST. No other part of the system needs to be running. Nobody can change a source file in a different part of the system, and screw your little MS up. Your MS is immune to all the other code out there.
You can test your MS with simple REST commands; and you can mock out other MSs in the system with little dummy MSs that do just what your tests need them to do.
Moreover, you can control the deployment. You don't have to coordinate with some huge deployment effort, and merge deployment commands into nasty deployment scripts. You just fire up your little MS and make sure it keeps running.
You can use your own database. You can use your own webserver. You can use any language you like. You can use any framework you like.
Freedom! Freedom!
<sarcasm> tag needed.

But wait. Why is this better? Are the advantages I just listed absent from a normal Java, or Ruby, or .Net system?
Yes:
  • existing databases tend to be attractors when new persistence requirements come up. So if I have MySQL up and running in my application and a job that would be a good fit for MongoDB comes up, I'm definitely not going to introduce MongoDB given the infrastructure setup time. I'll just go with the existing infrastructure and create some new tables, perpetuating the growth of the monolith.
  • Web servers are often tied to languages. If I want to use Node.js it will listen on the port 80 by itself, while PHP is commonly used with Apache, and Java with Tomcat or Jetty. 
  • JARs are a pretty JVM-specific packaging system. I'm definitely not going to put PHP code into JARs.
  • Frameworks come from the language, and even inside the same language I can have multiple PHP applications where one has a custom user interface and one serves a Angular single-page application.
Also the ones not listed:
  • It's easier to find out machines which contain bottlenecks and replace them, CPU and IO usage maps directly to applications.
  • It's easier to get started working as a new developer because you need just a single microservice to run on your machine.
  • It's easier to throw away one microservice and replace it with a new one doing the same job, but better written.
What about: Independent Deployability?
We have these things called jar files. Or Gems. Or DLLs. Or Shared Libraries. The reason we have these things is so we can have independent deployability.
Replacing single JARs or DLLs seems pretty dangerous to me where there are compile-time and binary dependencies in play. Since Uncle Bob has experience with that, I'm going to trust him to deploy safely this way.
Most people have forgotten this. Most people think that jar files are just convenient little folders that they can jam their classes into any way they see fit. They forget that a jar, or a DLL, or a Gem, or a shared library, is loaded and linked at runtime. Indeed, DLL stands for Dynamically Linked Library.
So if you design your jars well, you can make them just as independently deployable as a MS. Your team can be responsible for your jar file. Your team can deploy your DLL without massive coordination with other teams. Your team can test your GEM by writing unit tests and mocking out all the other Gems that it communicates with. You can write a jar in Java or Scala, or Clojure, or JRuby, or any other JVM compatible language. You can even use your own database and wesbserver if you like.
You can use any language you like as long as you run it on the JVM. Sure there must be people who work on other infrastructure or don't want to run their languages on a compatibly-yet-really-secondary platform? PHP applications? Ruby programmers?
If you'd like proof that jars can be independently deployable, just look at the plugins you use for your editor or IDE. They are deployed entirely independently of their host! And often these plugins are nothing more than simple jar files.
So what have you gained by taking your jar file and putting it behind a socket and communicating with REST?
SOAP is the last acronym where simple was used this way. Look, by generating a WSDL from your objects along with an XSD file that can be used to validate XML messages you can pass requests over HTTP with a Soap-Action header and regenerate Java (or other compatible languages) code from the WSDL...
One thing you lose is time. It takes time to communicate through a socket. It takes time to decode REST messages. And that time means you cannot use micro-services with the impunity of a jar. If I want two jars to get into a rapid chat with each other, I can. But I don't dare do that with a MS because the communication time will kill me.
Of course, chatty fine-grained interfaces are not a microservices trait. I prefer accept a Command, emit Events as an integration style. After all, microservices can become dangerous if integrated with purely synchronous calls so the kind of interfaces they expose to each other is necessarily different from the one of objects that work in the same process. This is a property of every distributed system, as we know from 1996.
On my laptop it takes 50ms to set up a socket connection, and then about 3us per byte transmitted through that connection. And that's all in a single process on a single machine. Imagine the cost when the connection is over the wire!
It takes more to write a file line by line rather than doing it in a single shot. However, if the file is 2GB long, I prefer the first solution in order to preserve memory. I'm just trading off time for another resource.
In the case of microservices, I'm trading off the latency of single interactions between different services for more important resources: programmer time, independent scalability, even time experienced by the end user. A front end asynchronously publishing events to a backend service feels faster to the user than a monolithic application where I respond to user requests and generate report lines in the same process or on the same machines.
Another thing you lose (and I hate to say this) is debuggability. You can't single step across a REST call, but you can single step across jar files. You can't follow a stack trace across a REST call. Exceptions get lost across a REST interface.
To me debuggability and introspection into an application improves when using microservices, because you will be full of all the HTTP logs of every service calling one another. You don't have to predispose logging cut points as they come for free with the HttpChannel objects. For a more business-oriented monitoring, take a look at Domain Events: we publish them from different applications in order to build reports based on data from different components.
After reading this you might think I'm totally against the whole notion of Micro-Services. But, of course, I'm not. I've built applications that way in the past, and I'll likely build them that way in the future. It's just that I don't want to see a big fad tearing through the industry leaving lots of broken systems in it's wake.
For most systems the independent deployability of jar files (or DLLS, or Gems, or Shared Libraries) is more than adequate. For most systems the cost of communicating over sockets using REST is quite restrictive; and will force uncomfortable trade-offs.
Paraphrasing Stroustrup, there are only two kinds of achitectures: the ones people complain about and the ones nobody uses. We are here proposing microservices because they have provided value in many systems that were once thought not to need them. As long as you have reporting needs you don't want to burden your front end with, or need to scale up in the number of users or programmer, you can consider microservices (and their cost).
My advice:
Don't leap into microservices just because it sounds cool. Segregate the system into jars using a plugin architecture first. If that's not sufficient, then consider introducing service boundaries at strategic points.
Please don't! The interaction between microservices are very different from the ones between objects inside a single application. Each call outside of the boundary is a potential failure mode that you should try to model as an asynchronous message that can be retried when delivery fails (the receiving microservice being down, slow or not reachable). Retrofitting microservices over an existing code base is a costly endeauvour and you should only embark on it if you have an adequate time and money budget, possibly bigger than the one necessary to build with microservices in the first place.

Sunday, August 17, 2014

Tabular data in Behat

All of this has happened before, and all this will happen again. -- BSG
I just watched Steve Freeman short talk "Given When Then" considered harmful (requires free login), and I was looking for some ways to cheaply eliminate duplication in Behat scenarios.

Fortunately, Behat supports Scenario Outlines for tabular data which is an 80/20 solution to transform lots of duplicated scenarios:
    Scenario: 3 is Fizz
        Given the input is 3
        When it is converted
        Then it becomes Fizz

    Scenario: 6 is Fizz too because it's multiple of 3
        Given the input is 6
        When it is converted
        Then it becomes Fizz

    Scenario: 2 is itself
        Given the input is 2
        When it is converted
        Then it becomes 2
into a table:
    Scenario Outline: conversion of numbers
        Given the input is <input>
        When it is converted
        Then it becomes <output>

        Examples:
            | input | output |
            | 2     | 2      |
            | 3     | Fizz   |
            | 6     | Fizz   |

Moreover, you can also pass tabular data to a single step with Table Nodes:
    Scenario: two items in the cart
        Given the following items are in the cart:
            | name    | price |
            | Cake    |     4 |
            | Shrimps |    10 |
        When I check out
        Then I pay 14
It takes a few minutes to learn how to do this into an existing Behat infrastructure. There are minimal changes to perform in the FeatureContext in the case of the Table Nodes, while Scenario Outlines are a pure Gherkin-side refactoring.

My code is on Github in my behat-tables-kata repository. If this reminds you of PHPUnit's @dataProvider, try to think of other patterns that can be borrowed from the xUnit world to fast-forward Cucumber and Behat development.

ShareThis