Invisible to the eye

I just want to run a container...

2023-05-11T20:37:00.002+02:00

...is a developer-centric point of view?

Generic data center image to exemplify a place many of us have never set foot into.

After a few hours spent on upgrading the toolchain that starts from Terraform, goes through a few AWS-maintained modules and reached the Elastic Kubernetes Service APIs, my team was entertaining the thought of how difficult it is to set up infrastructure. Infrastructure that, for this use case takes a container image of a web application that Docker generated and run it somewhere for users to access.

Thinking through this, either Kubernetes is the new IBM (no one ever get fired for choosing it), or there is more to a production web application than running a container which is what Kubernetes is often sold as: a tool to run containers in the cloud without the necessity to set up specific virtual machines, but treating them as anonymous, sacrificeable nodes where the aforementioned containers can be tightly packed and sharing CPU, memory and other resources.

What exactly is the infrastructure doing for us here? There are a few concerns about operations, the other side of that magic DevOps collaboration. For example:

the container image is exposing an HTTP service on port 80. This is not useful for a modern browser and its padlock icons, so to achieve a secure connection a combination of DNS and Let's Encrypt generated and automatically renews certificates after verifying proof of ownership of our domain name.
the container image produces logs in the form of JSON lines. Through part of the Grafana stack, these lines are annotated with a timestamp, the container that generated them and various labels such as an environment or the name of a component of the application (think frontend or prod or staging). After these lines are indexed, further software provides the capability to query these logs, zooming in on problems or prioritizing errors during investigation.
if we get too much of these logs at the level of error for a particular message, we'd like to receive an alert email that triggers us into action.
invariably applications benefit from scheduled processes that run periodically and independently from the request/response lifecycle. It's such a useful architectural pattern that Kubernetes even named a resource after the cron daemon, introduced in 1975.
it's also very useful for applications to maintain state and store news rows in a relational database. Provided turnkey by every cloud, a set of hostname, username and password allows access from any combination of programming language and library. No need to worry about rotating the logs of Postgres anymore, or that we are running out of space.
the data contained in the application and generated by user also lends itself to timely analysis, so that we know whether to kill a feature or to invest in it. This means somehow taking out recent updates (or whole tables) into a data pipeline that transforms it into something that can be analyzed by a data scientist. All with an acceptable speed.
Those credentials nevertheless need to be provided to the application itself, hopefully without accidentally disclosing them into chats, logs, or emails. No human should need to enter them into an interface.
Configuration is slightly easier to manage than credentials due to the lack of sensitive values, but we'd still like the capability to generate some of these values such as a canonical URL or to pass in different numbers for different environments.
And of course, whenever we commit, we'd like to deploy the change and start the latest version of our application, check that it is responding correctly to requests, and then stop the old, still running one.

This is all in the context of a single team, ~~running~~ operating a single product. Part of this is achieved through running off-the-shelf software on top of Kubernetes; part of it by paid-for cloud services; part of it by outsourcing to a specialized software as a service.

However, when we think about software development we don't necessarily think about software operations. Many of us started our careers by writing code, testing and releasing it without a further look to what was happening afterwards. DevOps as a philosophy meant bridging the gaps; You Build It You Run It is the easiest summary for me.

Yet what I see and what I hear from experienced people is that silos in infrastructure seriously slow teams down, and create fears of insurmountable problems. Abandoned projects, "finished" from a development point of view, but now having unclear ownership and running in production supporting the main revenue stream of an organization.

So I don't want to just run a container. I'd rather deploy many incrementally improved versions of a container, monitor its traffic and user activity, and close the feedback loop that links usage to the next development decision.

In other words: I can always build simple code if it doesn't have to run in production.

A year of mob programming, part 5: methodology

2022-01-20T19:22:00.001+01:00

I wouldn't call mob programming a methodology, but rather a practice that can be adopted. Like methodologies, however, it only surfaces problems and questions faster rather than providing solutions.

For example, a team might have a low bus factor due to the specialisms of its members, and only one person might be used to edit code in a certain language or application. Other team members might have gaps, such as not being set up with the tools and credentials to work on infrastructure and operations; or lacking the knowledge to perform screen reader testing for accessibility.

Besides exposing bus factors and gaps, mob programming brings design conflicts and opinions into the open. It takes little effort to turn the other way when looking at a pull request, or to switch to busy work or generic "refuctoring" when alone; disagreements and questions about value can often prop up under your nose inside a mob session.

However, that doesn't mean that the practice automatically gives the team the psychological safety to talk about those issues out loud, and to resolve them. That's like asking an hammer to walk to the shop and buy the right nails.

At the last part of this post, I thought I'd compare my mob programming experience with some of the theory; the theory being famous Agile software development methodologies and their values or principles.

The serene predictability of waterfall software development

Within Extreme Programming values:

Communication inside a mob is very high: synchronous conversations supported by a visual medium such as code or a digital whiteboard.
Simplicity is fostered by people in a diverse team reminding others that You Aren't Gonna Need It, but also needing to understand what's going on at all times.
Feedback from the rest of the team, or even from a customer proxy, is immediate.
Courage is needed to be a driver in an unfamiliar setting, or even a navigator asking the mob for help on a task we are uncomfortable with.
Respect is necessary to work face to face for prolonged periods of time.

Within Lean values, mob programming:

eliminates waste such as partially done work: there should ideally be no pull request open or branch, as code is produced and integrated in the mob continuously;
eliminates waste such as outdated features or gold plating: when you include a product owner in the mob, it's easy to steer and change the scope upon feedback rather than plowing on with the original idea;
eliminates waste such as handoffs: there is no packaging up of code for review, or of pixel-perfect Photoshop design to front end developers, or front end requests to back ends to be implemented, and so on.
reduces task switching: since everyone that is needed for progress is present, you often don't need to switch task because of waiting on someone, but can just carry on.
amplifies learning: ideas are tried out in code from the get go, and most of the team can learn from the results including the product owner.
optimizes the whole: most of all, we are not optimizing how much we can use everyone's time, but how fast we can flow a single piece from idea to customer value.

Image by Yann.

A year of mob programming, part 4: the remoteness of it

2022-01-08T11:29:00.002+01:00

I don't have a frame of reference for mob programming in a co-located environment, besides conference (or team exercise) workshops that didn't involve production code.

On a video call, the driver is still always explicitly defined as the person sharing their screen and tools. This is similar to the person physically having the keyboard when you are all sitting around a table.

The navigator can either be more implicit, move around or be strictly defined by a process. If an implicit navigator doesn't emerge for the current commit, it should be considered normal to quickly nominate one as needed. Over time, we normalize the phrase "can you navigate [that change]?" in response to a proposal.

Due to latency, in a video call environment there is a sort of Lorentz time where you may hear overlapping voices when someone else doesn't, and vice versa. It all depends on how much latency each video link is experiencing, and the adjustments that software makes can be jarring, such as video freezes or audio being delivered 10-20 seconds later.

"Sorry, two people talking over each other" "Really??"

I'm sure though everyone has already experienced people talking over each other in a physical room: it isn't exclusively a remote working problem and relies on team interactions more than technology.

We also want to communicate and especially give feedback through voice, as people's faces are not always visible to a driver looking at code; nodding doesn't necessarily help.

Perhaps a bigger problem is the fact that video calls don't allow side conversations to happen, as there is a single audio channel that our ears cannot separate into directions. This is likely to impact a massively parallel EventStorming around a whiteboard, less so an ensemble of 3-4 people trying to get to a consensus on their next move. The side conversations however include those famous water cooler talks, bumping into people in the kitchen or while on a break.

Your theory of mind might have more trouble cutting through the limited signal, and understand in what mental state other people in the team are; and whether today's problem is more due to Internet connection trouble, or a real conflict between team members. The daily retrospective helps to let people put tools down, and reflect on the day; possibly proposing experiments for the next session.

Looking at the world around me, I'm quite sure I would be as depressed by remote work as many, if it consisted of me being alone for most of the day. So due to mob programming I have a different perspective on how incredibly engaging working from home can be.

Stay tuned for the next (and last) part of this post, Methodology.

Image by Duesentrieb.

A year of mob programming, part 3: a laboratory for team dynamics

2021-11-17T20:45:00.003+01:00

Our implementation of mob programming consists in a permanent Google Meet videocall, with camera on, in which we rotate a developer sharing a screen containing IDE and browser. So there are two unusual practices to get used to: being in a group all the time, and being on a camera all the time.

I was originally surprised on how both practices went from exhausting to being a normal work day. This is my experience, and we have to consider neurodiversity and personal preferences in the team on having a long screen time and continuous social interaction. It's almost obvious to say, but sometimes people hate inconclusive meetings in which they have no say; not being together with other human beings.

Don't underestimate your capacity to adapt, but recognize that your energy is not going to always be the same; whether because you got a cold, or something is going on at home, or some task is particularly draining.

The Flemish inscription at bottom reads: because feedback is perfidious, I will go code in my cave

Gold cards and spike branches give people the ability to work on their own when they request to. Mechanical work such as upgrade of dependencies or investigating logs can benefit from focused time from one person. If you were in an office, you could go to an isolated room; working remotely, it's even easier as it just consists in leaving the video call and coming back refreshed later.

The opposite can also happen, when someone like me can manifest Fear Of Missing Out and the mob making lots of decisions in their absence. As the team gets to a norming phase, we should expect fewer surprises for members coming back to the mob after some time off.

Being together with the rest of the team can make your motivation higher due to the group support in getting started or unstuck from a particular problem; but of course the whole group can struggle on occasion.

Every time there is a new composition, though, there are changes in how the mob is working and what it pays more attention to. And from the point of view of a growing team leader, being embedded in a mob environment means being positively inundated with information. The challenge is to make sense of what we see to understand where team members are struggling and what are they finding most effective.

"I wish I could see players only in 121 meetings" -- No football coach ever

This requires a lot of (cognitive) bandwidth, and I usually can't at the same time focus on technical architecture and on social roles. But maybe this is just an argument to work in a team where the different hats can be rotated; as much as there is always someone writing code, there can always be someone checking if we have started to talk over each other, or we are falling into a rabbit hole of an hour without committing.

Stay tuned for the next part of this post, The remoteness of it.

Images by Michael Kranewitter and in the public domain.

A year of mob programming, part 2: Collective Code Ownership

2021-11-09T17:27:00.002+01:00

With respect to a team assigning tasks to developers by their function (frontend, backend, infrastructure, and so on), mob programming fosters Collective Code Ownership.

This is more in the sense that everyone should be able to contribute through the mob in any area, rather than everyone being able to prioritize on their own what needs to be improved. The pact is that our time belongs to the team, and the team decides what is important.

A result of this is an healthy prevention of knowledge silos, where you can exercise your own creativity just because no one else knows how to work in that area anymore.

Code review happens continuously in the mob on small changes, and there are no huge feature branches to go through. In large changes such as those, it's too late for review to suggest a valuable but completely different approach; there is too much investment of time and energy into code. If review dares to suggest groundbreaking changes, it causes extensive rework instead.

There is a larger picture on ownership: team members tend to keep track of what they care about, whether that is consistency in architecture, front end approaches to styling, performance issues, and so on.

Sometimes I found myself being talked out of a design I had in mind, counteracting bias I could have for the first or most familiar solution. I started working on this team with an object-oriented mindset, but we have transitioned to functional programming in TypeScript.

Sometimes, there are hills to die on: hard to reverse decisions on architecture, or programming language choices. It's a job of a psychologically safe team to discuss these choices collectively without drama or oversimplification like JavaScript being the the death of computer science.

"I told you we should have written the code in English, not Latin!"

It tooks three different iterations to get a repeatable pattern for accepting Commands (as in CQRS). But it was more fruitful to focus on the Domain Events design, as a difficult choice to revisit, than on the shape of the code itself which can be refined at any time.

Working outside of the mob

The counterpart to the XP mantra of writing all production code in pairs (or a mob) is to allow team members to contribute when they are working alone due to other necessities, such as an unstable Internet connection or a flexible day where they can't align completely with the core hours of the team.

I have learned to ask the team to be commissioned something to do, or at least to give them a choice. This keeps the prioritization in the hands of the group, again reducing bias.

It's helpful to assign problems that have only a constrained possible solution such as renaming, or propagating a refactoring through the codebase for consistency's sake.

It also helps to report back what you have learned during a solo coding exercise when rejoining the mob; unexpected issues or decisions you had to take on the spot and you feel unsure about.

You're allowed to stop at a certain point, as making progress alone at all costs is valued less than maintaining collective code ownership and a shared understanding of how our application works.

Technical spikes can also be chosen by a single person to work on, either because they want to individually prioritize them to demonstrate an idea; or because the amount of investigation required makes it difficult or frustrating to collaborate.
Spikes can be built on throwaway branches, being optimized for learning rather than for delivering production quality code. Once an idea has been demonstrated and approved, the mob can implement it pretty quickly on the trunk, and if we believe in the practice, with an higher level of quality. I constantly find feature branches worked by a single person to be a dead end (especially if they are my own).

In the end, people are frequently in meetings, researching or even just on holiday for a given day. Hence there's always someone missing that will catch up on the progress when they come back into the mob. But the group itself never stops even as the components change, so some of that progress will happen every day, other things being equal.

Stay tuned for the next part of this post, A laboratory for team dynamics.

Image by Johann Jaritz.

A year of mob programming, part 1: metaphors

2021-10-20T22:18:00.001+02:00

I've been practicing remote mob programming in my team for more than a year, writing more than 90% of production code inside a video call with multiple developers looking at the same screen.

A bit of context

I work in the scientific publishing domain, on web applications oriented to help scientists throughout their day.

My current team has been formed from the start as a remote team, and has been working with mob programming for more than a year. It is a cross-functional team that is including a product manager and often a designer into the (virtual room) for co-creation.

Our set of technologies includes TypeScript, lots of semantic HTML, and minimal client-side JavaScript. Container-based infrastructure and databases are pulled in as required by the evolution of the product.

This is an experience report, not advice that can be applied blindly to your team or organization.

Metaphorical definitions

All the brilliant people working on the same thing, at the same time, in the same space, and on the same computer -- Woody Zuill

Like pair programming, mob programming can be simplistically explained through the metaphor of driving a car in a rally. Or at least, that's how I always understood the driver and navigator pair.

A driver shares their screen and has sole access to the keyboard. They are an intelligent IDE capable of executing mechanical refactoring and fixes, but not of deciding a design direction for the code.
A navigator verbalizes what the driver should be doing, making decisions that reach down to what test or line of code to write.
The rest of the group is on the back seat of the car, and can speak at will, volunteering information to the navigator or answering questions. Are we there yet?

The roles of driver and navigator are rotated often, in fixed time or scope increments such as on every commit. The navigator can implicitly move continuously from one person to another, as long as there isn't more than one navigator at once.

Mob as a name

Speaking Italian as a first language, the term mob doesn't necessarily evoke an emotional reaction from me. But I can recognize a term that can be used to describe organized crime might not allow team members to feel comfortable. My understanding is that mob has to be intended as crowd rather than mafia.

"This is so wasteful. Why aren't they all in separate rooms playing 4 different songs?"

Nevertheless, we looked for more metaphors and terms that can be used to talk about the group of people that work together:

orchestra emphasizes the coordination required by a group of people, and their simultaneous presence. I'm not sure they need a conductor though.
ensemble is a similar musical metaphor, that focuses on a small group of complementary roles closely working together.
swarm refers to a group of bees all converging towards a single task to get it done.
the team itself ideally coincides with the mob, though there can be non-technical roles that find difficult to contribute all the time; the mob can also be temporarily reduced from the full set of developers in the team, or multiple mobs can appear in a relatively large team.
a womble is a fictional furry creature from children's books that helps the environment by cleaning up rubbish. It has no relation to groups whatsoever and is an unrelated word, with no unfortunate or fortunate connotations attached. We ended up using this word for a while for disambiguation.

These are all the metaphors that I've seen used to describe mobbing. Stay tuned for the next part of this post, Collective Code Ownership.

Image by Princess Ruto.

How is software like cooking?

2019-02-03T18:26:00.003+01:00

Time for a light-hearted post. After my move to the UK and having had my share of fish and chips, I have become by reaction more interested in Italian culinary history and practice. So I started diving into the science and the tradition of cooking, reading books such as the science of meat which combine chemistry and good taste, and I have now cooked enough lasagne to build a statistically significant sample.

Disclaimer: this post is full of meat references as that's culturally significant as a metaphor to transmit the concepts I have in mind. You may find this distasteful if you have chosen to follow a different path.

So here's 5 ways in which software development and cooking are alike...

Feedback loops

"to serve man"

There's a joke in a Futurama episode about Bender (being a robot) not having a sense of taste and hence playfully disgusting the humans in the team with too much salt. The joke works because in cooking you need a continuous feedback loop to conform to your taste, for example adding salt and pepper at the end of a preparation until it tastes right.

We are no strangers to this process in software development: most of the practices I preach about lead to getting working software in front of someone that will use it as soon as possible, to better steer future development with the feedback.

There are shorter, inner feedback loops than tasting: the speed with which meat is browning will make you adjust your source of heat for that phase of the preparation to avoid charring the exterior surface to a pitch black color. Not too different from your unit tests failing and informing you of an issue well before it gets to an actual customer that will send the steak back to the kitchen.

Quality is in the eye of the beholder

Cacio e pepe: simple&tasty or poor?

Taste has lots of different components, including not just what your tongue perceives but also smells, presentation, and expectations. But for most of these aspects, quality is in the eye of the beholder and we can't avoid coming to grips with the variety of people and cultures.

Therefore, despite how good you think your burgers are, some people just don't like fat, mince meat. I appreciate Indian curries but I have some physical limits on the spiciness levels that make me almost always choose mild korma. And imagine the cultural shock of discovering I have been wrongly putting lemon in tea all my life instead of milk as the only acceptable choice.
(I know, tea doesn't even grow in Europe, but I grew up with British tea as the standard.)

We all have good intentions in thinking hard about what a user will enjoy or be productive with; but we have to recognize there is a vast variety of users and we have to design (or cook) for each of them.

Control the process, rather than micromanaging the material

This cake was fluorescent on purpose

Convection ovens are a great example of a controlled process for cooking uniformly. In the context of large pieces of meat or fish, this mainly means getting them to a uniform, high-but-not-too-high temperature to avoid overcooking. The oven fan pushes hot air all around them, heating the surface evenly. Air transfers less heat than, for example, water; so there is time for the temperature to rise across your roasting chicken rather than overcooking the outside while leaving some parts dangerously raw.

Thus for large cuts this is pretty much impossible to achieve in a pan, unless you literally cut everything into slices thin enough that they can cook quickly. The oven-based process is much more convenient as you literally abandon your tray in there, checking from time to time if it's ready with a thermometer.

Generally speaking of enterprise applications and websites, I favor a process in which we catch bugs with multiple safety nets (up to user experimentation if possible) than overdesigning for every possible problem. While you can think of possible scenarios to test endlessly, bugs are always going to happen and it's more important to have a process in place for which they will be fixed and never regress because of new automated tests. That makes your software converge to a steady, stable state like a perfectly cooked chicken.

You need to measure

That's definitely cold, not just for meat

If you want to consistently cook meat to your liking, there is no escape from using a thermometer to understand when it's ready - its center reaching a set temperature that corresponds to medium-rare (for a steak) or well-cooked (for poultry) or something else. Looking at the external color? No relation with the inside. Checking how hard it has become? Too subjective. Roast for a certain amount of time? Ignores the variability of both the ingredients and the heat source.

When we perceive part of an application as slow, we need to use a profiler to find out what functions or methods are taking the most time to execute. Making as few assumptions as possible, we collect data to point us in the right direction. Opinions don't count: your browser timings and other metrics do (if collected correctly).

You can substitute ingredients, to a certain extent

Focaccia genovese

Cornstarch and flour are used in small quantities in many recipes, with the goal of thickening a liquid. This is due to their starch content, as this carbohydrate granules swell up with water creating enough friction to transform a liquid with the viscosity of water into something that feels like cream.

If you try to use cornstarch to make bread however, you won't be able to get an elastic product as it lacks the proteins that would build gluten. Even if you use the wrong flour (cake flour as opposed to bread flour, to keep it simple), this will greatly affect the result due to the smaller percentage of proteins that it contains. Baking, both for sweet and savoury goods, require much more precision.

In software development, we have grown up with Lego bricks as a metaphor and we continuously try to swap out pieces, hiding details behind a useful abstraction that sometimes leaks. Nowadays relational databases can be queried interchangeably if you stick to standard SQL queries. But the data types for columns can be pretty different in the range they support, especially if they are somewhat more exotic like JSON and XML fields rather than integer and strings. A wise decision is still required to understand when substituting components is possible, or where some combinations will never work.

And here are 5 ways in which software development and cooking are very different...

Cooking is a repeatable process

Lots of Dutch cheese around Amsterdam

Recipes (at least the good ones) are literally the codification of a process that should be robust to external variations to get a consistent result. It's the mark of a good cook to be able to deal with variations in ingredients or tools, but unless you are up on a mountain water boils at pretty much the same temperature, and the physical transformation that your carrots undertake when they are heated is well established.

In software, every new feature is a new design to make rather than the execution of a plan. Even porting software or reimplementing it bears surprises as the platform it is running on is now different. And no one understands how long it took to produce the original version, no matter give an estimation for the new one that is being created. We have processes for understanding what a feature should do, and safely implementing it and rolling it out; but there are always land mines waiting on the path.

Cooking has some precise physical quantities you can rely on

Peppered pork fillet steak

Understood: measurement is needed in both fields. But as much as your oven oscillates around its target temperature, it is still much more precise than a developer's effort. Even without meetings and other time variables, how fast and precise we are in a certain day varies: humans aren't robots. Just how knowledgeable we are about a technology influences greatly the designing and testing phases. The Mythical Man-Month remains, well, mythical.

In the food industry, the right tools can even measure the strength of a flour, to check whether it's good for the bread you want to obtain. If you look at a technology team, measuring how many tasks per week we have completed is probably as good as it gets. There's humans involved and applying social science to a very small group probably doesn't get you very far in terms of collecting data and drawing inferences.

You can still measure other times objectively, like time to deploy: how long it takes for a commit on master to reach the production environment. We partly do this because it's important but also because it's feasible to measure. What most project managers would care about will be time from idea to complete implementation instead. But that requires estimating the length of a queue that changes all the time, and is just the first step of a creative development process with its own variations.

Determinism of the digital world

Blackberries from the garden

It's pretty difficult to get the same tomatoes, courgettes or grapes as last week, and pretty much impossible to get the same ones in-season and off-season. You can ship them in from South Africa or Australia but travel time and refrigeration can modify their contents, and thus their taste.

If you look at a physical server, it's much more similar to laboratory equipment than to a living product: you can run programs and see them always taking a similar amount of time to complete, controlling the randomness of the operating system around it. This gets eroded a bit in the cloud, where performance may be affected by your neighbors due to co-tenancy.

Timing in the kitchen

Very unstable crochembouche

Whether it is simply changing the temperature of a meat joint, or a more complex transformation like baking a cake, timing should be one of the concerns if you want to obtain a good result. I formalize this concept by thinking that it's not possible to stop time, in many cases.

Cooks know tricks like cooking eggs or rice to a certain degree, than cooling it down and finish the process later when the food has to be served; or simply reheat it if fully cooked. This works for various categories of products, but it's an ad-hoc process.

Consider the power we have in a digital world: firing up a debugger literally stops execution at some point in the life of the program, allowing us to take a look at what we want in the right context. Since the state of the program is the Matrix, we can slow it down, speed it up, and change things causing a déjà vu to your objects.

If you want to reproduce some computation, you have the tools available to build a Docker image containing all sorts of dependencies and store it for future usage. If you want to reproduce your perfect croissants, the only tools you have are a recipe and your own memories. Add the variation of ingredients and even temperature and humidity in your kitchen, and you can understand why scientific exploration needs a laboratory with its controlled conditions to be able to make progress.

Cooking equipment makes a difference

Now, grate parmesan without this...

Besides basic tools like appropriately shaped knives, a pressure cooker would make you able to reach certain results that would take a long time with an ordinary pot of boiling water. A temperature bath (I don't own one of these) can help cooking meat evenly only to then finish the process with a 2-minute searing. Even a scale is just necessary for baking, as measuring ingredients like flour by volume has a 50% margin of error due to its compressibility.

Consider how you can write code on your old laptop from the beach instead. You target an open source interpreter, and the end product will run on the same server that could accept strictly regulated banking software. As long as you can literally string bytes together, you can produce running software: everything else helps. The ephemeralization of software tools due to virtualization and the large availability of open source platforms make digital startups a reality, whereas opening a restaurant remains a capital-intensive operation.

But there's more...

The power of metaphors

Metaphors can foster understanding of a new system, or lead us ashtray. They are powerfully transmitting a mental model, but that model has its limitations and may even be less precise than a more formal model like a math analogy. But especially in complicated fields like cryptography, terms such as key and signature have popularized concepts to generation of students that would have otherwise found them very hard to think about.

I wrote this post for fun, but I stand behind most of the comparisons: that's all for now. You'll find me using an Helm to ship my containers...

Practical Helm in 5 minutes

2019-01-10T19:59:00.000+01:00

Yet another ship-themed name

Containerization is increasingly a powerful way to deploy applications on anonymous infrastructure, such as a set of many identical virtual machines run by some cloud provider. Since container images ship a full OS, there is no need to manage packages for the servers (a PHP or Python interpreter), but there are still other environment-specific choices that need to be provided to actually run the application: configuration files and environment variables, ports, hostnames, secrets.

In an environment like Kubernetes, you would create all of this declaratively, writing YAML files describing each Pod, ConfigMap, Service and so on. Kubernetes will take these declarations and apply them to its state to reach what is desired.

As soon as you move outside of a demo towards multiple environments, or towards updating one, you will start to see Kubernetes YAML resources not directly as code to be committed into a repository, but as an output of a generation process. There are many tweaks and customizations that need to be performed in each environment, from simple hostnames (staging--app.example.com vs app.example.com) to entire sections being present or not (persistence and replication of application instances).

The problem you need to solve then is to generate Kubernetes resources from some sort of templates: you could choose any template engine for this task, and execute kubectl apply on the result. To avoid reinventing the wheel, Helm and other competitors were created to provide an higher abstraction layer.

Enter Helm

Helm provides templating for Kubernetes .yaml file; as part of this process, it extracts the configuration values for Kubernetes resources into a single, hierarchical data source.

Helm doesn't stop there however: it aims to be a package manager for Kubernetes, hence it won't just create resources such as a Deployment, but it will also:

apply the new resources on the Kubernetes cluster
tag the Deployment with metadata and labels
list everything that is installed in terms of applications, rather then Deployments and ConfigMaps
find older versions of the Deployment to be replaced or removed

The set of templates, helpers, dependencies and default values Helm uses to deploy an application is called a chart whereas every instance of a chart created on a cluster is called a release. Therefore, Helm keeps track of objects in terms of releases and allows you to update a release and all its contents, or to remove it and replace it with a new one.

Folder structure

The minimal structure of an Helm chart is simply a folder on your filesystem, whose name must be the name of the chart. As an example, I'll use green-widgets as a name, a fictional web application for ordering green widgets online.

This is what you'll see inside a chart:

Chart.yaml: metadata about the chart such as name, description and version.
values.yaml: configuration values that may vary across releases. At a bare minimum the image name and tag will have defaults here, along with ports to expose.
the templates/ subfolder: contains various YAML templates that will be rendered as part of the process of creating a new release. There is more in this folder like a readme for the user and some helper functions for generating common snippets.

Apart from this minimal setup, there may also be a requirements.yaml file and a charts/ subfolder to deal with other charts to use as dependencies; for example, to install a database through an official chart rather than setting up PostgreSQL replication on your own. These can be safely ignored until you need these features though.

Once you have the helm binary on your system, you can generate a new chart with helm create green-widgets.

Cheatsheet

You can download a helm binary for your platform from the project's releases page on Github. The helm init command will use your kubectl configuration (and authentication) to install tiller, the server-side part of Helm, onto a cluster's system namespace.

Once this is setup, you will be able to execute helm install commands against the cluster, using charts on your local filesystem. For real applications, you can install official charts that are automatically discovered from the default Helm repositories.

The command I prefer to use to work on a chart however is:
helm upgrade --install --set key=value green-widgets--test green-widgets/

The mix of upgrade and install means this command is idempotent and will work for the first installation as well as for updates. Normally you would issue a new release for a change to the chart, but this approach allows you to test out a chart while it's in development, using a 0.0.1 version.
There is no constraint on the release name green-widgets--test, and Helm can even generate random names for you. I like to use the application name and its environment name as a team convention, but you should come up with your own design choices.

A final command to keep in mind is helm delete green-widgets--test which will delete the release and all the resources created by your templates. This is enough to stop using CPU, memory and IP addresses, but it's not enough to completely remove all knowledge of the release from Tiller's archive. To do so (and free the release name allowing its re-creation) you should use add the --purge flag.

Caveats

This 5-minute introduction makes it all seem plain and simple, but it should be clear that simply downloading Helm and installing it is not a production-ready setup. I myself have only rolled out this setup to testing environment at the time of writing.

I can certainly see several directions to explore, that I either cut from the scope in order to get these environments up and running for code review; or investigated and used but not included in this post. For example:

requirements.yaml allows to include other charts as dependencies. This is very powerful for off-the-shelf open source software such as databases, caches and queues; it needs careful choices for the configuration values being passed to these dependencies, and your mileage may vary with the quality of the chart you have chosen.
chart repositories are a good way to host stable chart versions rather than copying them onto a local filesystem. For example, you could push tarballs to S3 and have a plugin regenerate the index.
the whole Helm and Tiller setup arguably needs to be part of a Infrastructure as Code apporach like the rest of the cluster. For example, I am creating a EKS cluster using Terraform and that would need to include also the installation and configuration of Tiller to provide a turnkey solution for new clusters.

The path from custom VM to VM with containers

2019-01-02T10:42:00.000+01:00

Image of a single container being transported by OiMax

Before the transition to Docker containers started at eLife, a single service deployment pipeline would pick up the source code repository and deploy it to one or more virtual machines on AWS (EC2 instances booted from a standard AMI). As the pipeline went across the environments, it repeated the same steps over and over in testing, staging and production. This is the story of the journey from a pipeline based on source code for every stage, to a pipeline deploying an immutable container image; the goal pursued here being the time savings and the reduced failure rate.

The end point is seen as an intermediate step before getting to containers deployed into an orchestrator, as our infrastructure wasn't ready to accept a Kubernetes cluster when we started the transition, nor Kubernetes itself was trusted yet for stateful, old-school workloads such as running a PHP applications that writes state on the filesystem. Achieving containers-over-EC2 allows developers to target Docker as the deployment platform, without realizing yet cost savings related to the bin packing of those containers onto anonymous VMs.

Starting state

A typical microservice for our team would consist of a Python or PHP codebase that can be deployed onto a usually tiny EC2 instance, or onto more than one if user-facing. Additional resources that are usually not really involved in the deployment process are created out of band (with Infrastructure as Code) for this service, like a relational database (outsourced to RDS), a load balancer, DNS entries and similar cloud resources.

Every environment replicates this setup, whether it is a ci environment for testing the service in isolation, or an end2end one for more large-scale testing, or even a sandbox for exploratory, manual testing. All these environments try to mimic the prod one, especially end2end which is supposed to be a perfect copy on fewer resources.

A deployment pipeline has to go through environments as a new release is promoted from ci to end2end and prod. The amount of work that has to be repeated to deploy from source on each of the instances is sizable however:

ensure the PHP/Python interpreter is correctly setup and all extensions are installed
checkout the repository, which hopefully isn't too large
run scripts if some files need to be generated (from CSS to JS artifacts and anything similar)
installing or updating the build-time dependencies for these tasks, such as a headless browser to generate critical CSS
run database migrations, if needed
import fixture data, if needed
run or update stub services to fill in dependencies, if needed (in testing environments)
run or update real sidecar services such as a queue broker or a local database, if present

These ever-expanding sequence of operations for each stage can be optimized, but in the end the best choice is not to repeat work that only needs to be performed once per release.

There is also a concern about the end result of a deploy being different across environments. This difference could be in state, such as a JS asset served to real users being different from what you tested; but also in outcome, as a process that can run perfectly in testing may run into a APT repository outage when in production, failing your deploy halfway through, only on one of the nodes. Not repeating operations leads not just to time savings but to a simpler system in which fewer operations can fail just because there are fewer of them in general.

Setting a vision

I've automated before builds that generated a set of artifacts from the source code repository and then deploy that across environments, for example zipping all the PHP or Python code into an archive or in some other sort of package. This approach works well in general, and it is what compiled languages naturally do since they can't get away with recompiling in every environment. However, artifacts do not take into account OS level dependencies like the Python or PHP version with their configuration, along with any other setup outside of the application folder: a tree of directories for the cache, users and groups, deb packages to install.

Container images promise to ship a full operating system directory tree, which will run in any environment only sharing a kernel with its host machine. Seeing docker build as the natural evolution of tar -cf ... | bzip2, I set out to port the build processes of the VMs into portable container images per each service. We would then still be deploying these images as the only service on top an EC2 virtual machine, but each deployment stage should just be consisting of pulling one or more images and starting them with a docker-compose configuration. The stated goal was to reduce the time from commit to live, and the variety of failures that can happen along the way.

Image immutability and self-sufficiency

To really save on deployment time, the images being produced for a service must be the same across environments. There are some exceptions like a ci derivative image that adds testing tools to the base one, but all prod-like environment should get the same artifact; this is not just for reproducibility but primarily for performance.

The approach we took was to also isolate services into their own containers, for example creating two separate fpm and nginx images (wsgi and nginx for Python); or to use a standard nginx image where possible. Other specialized testing images like our own selenium extended image can still be kept separate.

The isolation of images doesn't just make them smaller than a monolith, but provides Docker specific advantages like leveraging independent caching of their layers. If you have a monolith image and you modify your composer.json or package.json file, you're in for a large rebuild. But segregating responsibilities leads instead to only one or two of the application images being rebuilt: never having to reinstall those packages for Selenium debugging. This can also be achieved by embedding various targets (FROM ... AS ...) into a single Dockerfile, and having docker-compose build one of them at a time with the build.target option.

When everything that is common across the environments is bundled within them, what remains is configuration in the form of docker-compose.yml and other files:

which container images should be running and exposing which ports
which commands and arguments the various images should be passed when they are started
environment variables to pass to the various containers
configuration files that can be mounted as volumes

Images would typically have a default configuration file in the right place, or be able to work without one. A docker-compose configuration can then override that default with a custom configuration file, as needed.

One last responsibility of portable Docker images is their definition of a basic HEALTHCHECK. This means an image has to ship enough basic tooling to, for example, load a /ping path on its own API and verify a 200 OK response is coming out. In the case of classic containers like PHP FPM or a WSGI Python container, this implies some tooling will be embedded into the image to talk to the main process through that protocol rather than through HTTP.

It's a pity to reinvent the lifecycle management of the container (being started, then healthy or unhealthy after a series of probes), whereas we can define a simple command that both docker-compose or actual orchestrators like Kubernetes can execute to detect the readiness of the new containers after deploy. I used to ship smoke tests with the configuration files to use, but these have largely been replaced by polling for an health status on the container itself.

Image size

Multi-stage builds are certainly the tool of choice to keep images small: perform expensive work in separate stages, and whenever possible only copy files into the final stage rather than executing commands that use the filesystem and bloat the image with their leftover files.

A consolidated RUN command is also a common trick to bundle together different processes like apt-get update and rm /var/lib/apt/lists/* so that no intermediate layers are produced, and temporary files can be deleted before a snapshot is taken.

To find out where this optimization is needed however, some introspection is needed. You can run docker inspect over a locally built image to check its Size field and then docker history to see the various layers. Large layers are hopefully being shared between one image and the next if you are deploying to the same server. Hence it pays to verify that if the image is big, most of its size should come from the ancestor layers and they should seldom change.

A final warning about sizes is related to images with many small files, like node_modules/ contents. These images may exhaust the inodes of the host filesystem well before they fill up the available space. This doesn't happen when deploying source code to the host directly as files can be overwritten, but every new version of a Docker image being deployed can easily result in a full copy of folders with many small files. Docker's prune commands often help by targeting various instance of containers, images and other leftovers, whereas df -i (as opposed to df -h) diagnoses inodes exhaustion.

Underlying nodes

Shipping most of the stack in a Docker image makes it easier to change it as it's part of an immutable artifact that can be completely replaced rather than a stateful filesystem that needs backward compatibility and careful evolution. For example, you can just switch to a new APT repository rather than transition from one to another by removing the old one; only install new packages rather than having to remove the older ones.

The host VMs become leaner and lose responsibilities, becoming easier to test and less variable; you could almost say all they have to run is a Docker daemon and very generic system software like syslog, but nothing application-specific apart from container dependencies such as providing a folder for config files to live on. Whatever Infrastructure as Code recipes you have in place for building these VMs, they will become easier and faster to test, with the side-effect of also becoming easier to replace, scale out, or retire.

An interesting side effect is that most of the first stages of projects pipelines lost the need for a specific CI instance where to deploy. In a staging environment, you actually need to replicate a configuration similar to production like using a real database; but in the first phases, where the project is tested in isolation, the test suite can effectively run on a generic Jenkins node that works for all projects. I wouldn't run multiple builds at the same time on such a node as they may have conflicts on host ports (everyone likes to listen on localhost:8080), but as long as the project cleans up after failure with docker-compose down -v or similar, a new build of a wholly different project can be run with practically no interaction.

Transition stages

After all this care in producing good images and cleaning up the underlying nodes, we can look at the stages in which a migration can be performed.

A first rough breakdown of the complete migration of a service can be aligned on environment boundaries:

use containers to run tests in CI (xUnit tools, Cucumber, static checking)
use containers to run locally (e.g. mounting volumes for direct feedback)
roll out to one or more staging environments
roll out to production

This is the path of least resistance, and correctly pushes risk first to less important environments (testing) and only later to staging and production; hence you are free to experiment and break things without fear, acquiring knowledge of the container stack for later on. I think it runs the risk of leaving some projects halfway, where the testing stages have been ported but production and staging still run with the host-checks-out-source-code approach.

A different way to break this down is perform the environment split by considering the single processes involved. For example, consider an application with a server listening on some port, a cli interface and a long-running process such as a queue worker:

start building an image and pulling it on each enviroment, from CI to production
try running CLI commands through the image rather than the host
run the queue worker to the image rather than the host
stop old queue worker
run the server, using a different port
switch the upper layer (nginx, a load balancer, ...) to use the new container-based server
stop old server
remove source code from the host

Each of these slices can go through all the environments as before. You will be hitting production sooner, which means Docker surprises will propagate there (it's still not as stable as Apache or nginx); but issues that can only be triggered in production will happen on a smaller part of your application, rather than as a big bang of the first production deploy of these container images.

If you are using any dummy project, stub or simulator, they are also good candidates for being switched to a container-based approach first. They usually won't get to production however, as they will only be in use in CI and perhaps some of the other testing environments.

You can also see how this piece-wise approach lets you run both versions of a component in parallel, move between one and the other via configuration and finally remove the older approach when you are confident you don't need to roll back. At the start using a Docker image doesn't seem like a huge change, but sometimes you end up with 50 modified files in your Infrastructure as Code repository, and 3-4 unexpected problems to get them through all the environments. This is essentially Branch by Abstraction applied to Infrastructure as Code: a very good idea for incremental migrations applied to an area that normally needs to move at a slower pace than application code.

Delivery pipelines for CDNs

2018-12-28T10:28:00.003+01:00

In the last couple of years I have integrated Content Delivery Networks into various eLife applications, managing objects ranging from static files and images to dynamic HTML. These projects mainly consisted of:

implementing Infrastructure as Code for these CDNs inside the Github repositories we already use for all other cloud resources (AWS and GCP)
effectively authorize HTTPS on the CDN side, which will be impersonating your origin servers
create instances of the same CDN services, first in testing and then in production environments, keeping them in parity with each other
expand end-to-end testing (the tip of the pyramid) to cover also the CDNs rather than just covering the applications involved
integrate logging in order to catch any problem happening between the user and the origin servers
finally phase in the new CDNs with new geotagged DNS entries

Our first implementation from 2016 was widely integrated into AWS and as such CloudFront was the chosen solution. We subsequently switched to Fastly for all ordinary traffic, experiencing a general increase in features, customization and expenses. What follows is a comparison that isn't just meant to orient the reader between CloudFront and Fastly, but also against the third option of not using a CDN at all. In fact, there are many concerns that may be glossed upon but that you need to take into account seriously when you move your web presence from a few origin servers to a global network of shared, locked down servers managed by an external organization.

Infrastructure as Code

Our AWS-based setup is making a large use of CloudFormation, the native service for declaratively specifying resources such as servers, load balancers and disks. The simple setup has been augmented over the years by a code generation layer for the CloudFormation templates; this Python code reduces duplication between the various templates by starting from standard EC2/ELB/EBS resources that can be customized in size and other parameters.
If we start from a simple single-server setup for a microservice (this was before Docker containers got stable enough), we are looking at a template containing at least an EC2 instance and a DNS entry pointing to it. With multiple servers, we expand this with a load balancer that pulls in a TLS certificate provided to IAM by an administrator.
To configure CloudFront via CloudFormation, an additional resource for the CDN distribution is introduced. All the configuration you need will be visible in this resource, a JSON dictionary or XML tag respecting a certain schema.
Since CloudFormation can only manage AWS resources and nothing outside that tended garden, Fastly was the reason for introducing Terraform alongside it. Whereas almost anything AWS-specific still goes through CloudFormation, Terraform has opened up new roads such as Infrastructure as Code implementations for Google Cloud Platform (storage buckets and BigQuery tables).
Applying changes in this context is not trivial as you may inadvertently reboot or destroy a server while believing you were only changing a minor setting. Yet Infrastructure as Code is about making the current state of infrastructure and all changes visible, easy to review and safe to rollout across multiple environments. It is imperative therefore to maintain testing environments created with the same tooling as production, and to use them to ultimately integration test all changes.
The caveat of using multiple tools in lockstep for the same instance of a project (including servers, cloud resources and CDNs) is that they can't declare dependencies between resources managed by different tools. For example, since we manage DNS in CloudFormation and Fastly CDNs in Terraform, we can both at the same time but can't couple together the existence of a DNS and the CDN it points to, or impose a creation or update order that is different from the general order we run the tools in.
The most glaring difference in updates rollout between the various options is that, to rollout a CDN configuration change, it takes:

no deployment time if you don't use a CDN (obviously)
10s of seconds for Fastly
10s of minutes (up to 1 hour was common) for CloudFront

This means Fastly opens up the possibility for experimentation, even if with a slower feedback that your local TDD cycle. With CloudFront this is painful and haphazard as you decide on a change, start applying it and come back one hour later to check its effects, after having already switched to another task.
Still, minutes of update and/or creation time make Fastly unavailable for inclusion in the CI environments where the tests of a single service are run. You could in theory create a Fastly service on the fly when the build of the service runs, but this will add minutes to your build _and_ also promote coupling to the CDN itself. Fast forward this a bit and you'll see an application unable to be run locally anymore for exploration because of the missing CDN layer. Therefore, like cloud services the CDN is treated like a long-lived resource, with its regression testing performed into a shared environment on every new application commit, but after merge.

Logging

Within a web service, you usually have some kind of access log being generated by nginx or Apache. These logs can sit on a single server or can be uploaded to some aggregation point, whether it is a local Logstash or an external platform that can index them.
Even load balancing doesn't change this picture very much as the load balancer(s) logs should be identical to the ones of the application servers if everything is working well. But with a CDN, large-scale caching is introduced and so it's plausible that you will stop directly seeing a large percentage of your traffic. Statistics or monitoring based on access logs may get skewed; or worse, Japan may be cut off from your website for a while because the health checks from the CDN points of presence there have a timeout of a few milliseconds too low to get to your servers in us-east-1 (of course this never happened).
Hence, to understand what's going on in those few hundred servers you have no access to, you need a way to stream them to some outsourced service; this can be storage as a service (S3 or GCS) or directly some log infrastructure provider. The latency with which logs can get in the right place is a key metric of the feedback loop from changes.
Since we are striving for Infrastructure as Code, all the logging configuration should be kept under version control together with hostnames and caching policies. We got to a standard logging format (JSON Lines with certain fields) and frequency, along with GCS bucket where to put new entries, bucket names following conventions. This was later expanded into BigQuery tables providing queries over the same data, after the Terraform Fastly provided started supporting this delivery mechanism.
The main difficulty in integration was credentials management: you aren't told much if credentials are not correct or not authorized to perform certain actions like writing to BigQuery. Moreover, you can't just commit a bunch of private keys for anyone to see, especially since Infrastructure as Code repositories tend to be made very visible to as many people as possible.
We ended up putting GCP credentials and similar secrets in Vault, running on the same server as the Salt master (same thing as Puppet master). The GCP Service Account itself and its permissions to write to the bucket needed some special permissions to set up (it's turtles all the way down) so couldn't put it directly into Infrastructure as Code but had an admin manually creating it instead. The ideal thing would be for Vault to generate credentials by itself, following the pattern of periodically rotating them. But then it would need to push these credentials somehow into the Fastly configuration and I'm here to provide efficient delivery pipelines, not make cloud giants wrestle.

Flexibility

Your own application is usually highly customizable, with a certain cost associated. You have to write some code in your favorite programming language, possibly following some framework conventions and calling your classes Middleware or EventListener.
CDNs work on shared servers, so they have limits on what can be safely run in that sandboxed environment. Nevertheless, Fastly provides the possibility to customize the VCL that runs each service with your own snippets and macros.
This is very flexible, perhaps even too much: you can introduce headers with random values, write conditionals and implement loops by restarting requests. It feels similar to working in nginx configurations but with a more predictable language.
The main problem with this form of customization is that there is no way to run it or test it on your own. The best feedback loop we found is the Fastly Fiddle (similar to JS Fiddle) where you test out bits of code, hit a save button and see it propagated to servers around the world for you to test.
The fact that this even exists is impressive, but you can imagine how well it works for actual development. Once you get past experimenting, you can't integrate a Fiddle with your own Infrastructure as Code approach (e.g. Terraform templates) nor easily port code from one to another besides copying and pasting. You can run integration-only tests in some other window, but the feedback loop can't be shorter than the deployment time; unit tests are not a thing. You can't even use your IDE as much as you may love it. In the end, Fastly's Varnish diverged from the open source one 4 major versions ago; hence, this VCL is a proprietary language and you'll feel the same as writing stored procedures in Oracle's PL/SQL.
I tend to see VCL and other intermediate declarative templates (such as Terraform .tf files) as a generation target for Infrastructure as Code to compile to. This lets you unit test that your tools generate a certain output for these templates; use dummy inputs in tests and check dummy expected outputs; all of this will still need to be integration tested with the application itself in a real environment, but some of the responsibilities can be developed in the tool itself and reused across many applications.

Integration testing

We have understood by now that to keep the ensemble of servers, code, cloud services and CDNs we need some automated integration testing in place that touches all the different pieces. We don't want many scenarios to be tested at this level because it's slow and brittle to do so, but we need a tracer bullet that goes through everything, if only to verify all configurations are correct.
In the general context of outsourcing of responsibilities to a service or a library, you still own it as a dependency of your application and still need to verify the emergent behavior of custom code and borrowed architecture.
Therefore, I always put at least a staging environment in place replicating production where automated tests can run. This doubles as the place where to try and roll out infrastructure updates that are risky (which are risky? If you have to ask, all of them; just roll out everything through staging).
As we have seen, creating too many different, ad-hoc environments to test pull requests doesn't scale; this will reach death by feature branch as all of your Jenkins nodes are waiting for yet one more RDS node or CloudFront distribution to be created.
A common example of a coupled, integration-related feature to test is the forwarding of Host and other headers; these go through so many layers: a couple of CDN servers, a load balancer, an nginx daemon and finally the application. Some headers don't just have to be forwarded, but have to be rewritten or renamed or added (X-Forwarded-For). All of this can in theory be specified for every single layer but testing the whole architecture probably makes for easier long-term maintenance.

Why?

In various projects you always have to ask yourself why you are doing something (especially complex things) and what value you want to get out of it. CDNs are one of the go-to solution for web performance, their killer feature being huge caches for slow-changing HTML and assets across the world so that even a casual Indian reader can load your homepage in one second. Moreover, if done right the load on your origin servers will also be greatly reduced with respect to not using caching layers.
On the other hand, you can see the complexity, observability and maintenance needs that every additional layer introduces. When asking whether a CDN should do something or your application should do something, it's the same decision as for a database or a cloud service: how can you effectively store and update its configuration in multiple environments? Do you want to oursource that responsibility? How will you know when something's wrong? Do you feel comfortable writing stored procedures in a language you can't run on your laptop? All of these are architectural questions to go through when evaluating various CDNs, or no CDN.

Book review: The 5 Dysfunctions of a team

2018-12-06T17:01:00.000+01:00

This is a spur-of-the-moment review of The 5 Dysfunctions of a Team, a business novel on team health that I've read today as part of the quarterly Professional Development Days I take as part of working at eLife.

As a follow-up to my role evolution into Software Engineer in Tools and Infrastructure, I am looking again more into the people skills side of my job (as opposed to purely technical skills). I have done this cyclically during my career, as the coder hat becomes too restrictive and you have to pick up other tools to achieve improvement. In particular, I am working on eLife's Continuous Delivery platform and it is crucial to work with multiple product-oriented teams to have them adopt your latest Jenkins pipelines and Github reports.

Dysfunctions

Patrick Lencioni's model of team dysfunctions (or of blessed behaviors if you flip all definitions) is a pyramid where each dysfunction prevents the next level from being reached. It would be a disservice to how well and quickly this is got across in the novel to just try to list them here, but if I had to summarize this in a long paragraph it would look like:

Building trust between team members allows constructive conflicts, which enable people to commit to action and hold each other accountable for what has been decided; all in the service of results. -- not really a quote

The dysfunctions are the flip side of these positive behaviors, for example lack of trust or fear of conflict. The definitions of some of these terms are more precise than what you find in many Agile and business coaching books; so don't dismiss trust as just a buzzword, for example.

Some context

The case being treated in the book, which comes from the author's management consulting firm, is that of a CEO turning around a team of executives. This makes for a somewhat more fascinating view of results as it's talking about a (fictional) company's IPO or eventual bankruptcy. Despite not being a clear parallel to a software development team, I do think this is applicable in every situation where professionals are paid to work daily together, with some caveats.
In fact, I suspect the level of commitment to the job that you see in the book would be typical of either high stakes roles (executives) or a generally healthy organization that has already removed common dysfunctions at the individual level. If in your organization:

people are primarily motivated by money
they look forward to 5 PM
they browse Facebook and Twitter for hours each day

then there are personal motives that have to be addressed before teams can start thinking about collective health.

Yes but, what can I do in practice?

After the narrative part, an addendum to the book contains a self-administered test to zoom in on which possible dysfunctions your team may exhibit at the moment. It continues with a series of exercises and practices that address these topics, with an estimation of their time commitments or how difficult they would be to run. I definitely look forward to anonymously try out the test with my technical team, out of curiosity for other views.

My conclusions

I think most of the dysfunctions are real patterns, that can only be exacerbated by the currently distorted market for software developers and CV-driven development. The last dysfunction, Inattention to results, is worth many books on its own on how to define those results as employees at all levels are known for optimizing around measurable goals to the detriment of, for example, long-term maintenance and quality.
So don't start a crusade armed by this little book, but definitely keep this model in your toolbox and share it with your team to see if you can all identify areas for collective improvement; it is painfully obvious to say you can't work on this alone!
The author is certainly right when writing that groups of people truly working together can accomplish what any assembly of single individuals could never dream of doing.

Eris 0.11.0 is out

2018-09-16T10:59:00.000+02:00

Eris 0.11.0 has been freshly released, and I'll be listing here various contributions that the project has received that are included in this new version and in the previous one, 0.10.0, which didn't have an associated blog post.

For a full list and links to the relevant pull requests and commits, see the ChangeLog.

0.10

The Eris\Facade class was introduced to allow usage outside of a PHPUnit context.
Official PHPUnit 7 support was introduced.
Fixed a corner case in suchThat()

There are some small backward compatibility breaks with respect to 0.9; they regard unused features (or at least I thought) including Generator::contains().

0.11

Official PHP 7.2 support
Annotations support for configuring behavior that is usually configured through methods: @eris-method, @eris-shrink, @eris-ratio, @eris-repeat, @eris-duration

Some acknowledgements

Most of this work comes from contributions, not from me. I'd like to say a word of thanks to the people that have taken the time to use Eris in some of their projects but also to feed back a fix, an extension, or a substantial improvement.

Book review: Production-ready microservices

2018-01-21T13:41:00.005+01:00

Production-Ready Microservices is a short book about consistently practicing architecture and design over a fleet of microservices.
In general, I think the principles described here apply very much to any service-oriented initiative, even more so if the services are coarse grained and hence require more maintenance than finely isolated ones.

Uber

The book extrapolates from the author's experience at Uber "standardizing over a thousand microservices". Given a few developers for each microservice team, that makes up 2000-3000 engineers from the total >10000 Uber employees (I wonder how many are lawyers). After WhatsApp's famous story of being acquired at 55 employees in total, that really highlights the difficulty level of running a business and operations all over the physical world (sending cars and drivers around in dozens of countries) with respect to a digital-only enterprise. We should remember this and many other directions of change the next time we hear a technology advocate saying how much the cost of his 2-people startup has been reduced by $technology.

The main message

You should be this tall to use microservices; this architecture doesn't necessarily fit every context; although integrating separate services of some size is becoming a standard after the API revolution (before that, it was integrate through the database which is arguable worse).
You will encounter many different social and technical problems, such as:

Inverse Conway's Law, with the shape of the products defining the shape of the company. Although I found out this doesn't really apply at smaller scales as development teams can own more than one service and experience a successful decoupling between people and code.
Technical sprawl, where multiple languages, databases and other key choices spread without a consistent, central planning.
More ways to fail: distributed and concurrent systems are more difficult to work with and to reason upon, plus the fact that there are more servers, containers or applications will simply multiply the failures you'll see.

There are lots of non-functional requirements like scaling each microservice and isolating it from the rest of the fleet; perhaps don't go too micro- if you don't have the resources to ensure an acceptable level in each service. Perhaps in your context the acceptable SLA for some particular service is low, because it's not change often or is only internally facing or is only used several times per day.
One particular aspect of the Consistency is important lesson is that the whole lifecycle of services should be considered. Maintenance and even decomissioning are as important as producing new MVPs: but I've seen many times services being neglected, or being considered very easy to migrate away from one some new shiny substitute was available. In reality, it takes time and effort to keep services up and running, and to finally kill them when you have an alternative, as data and users are slowly migrated off from the old to the new platform.
Lots of requirements are also overlooked but often turn out to be important as you increase your population of services: the scalability of a single endpoint, fault tolerance, even documentation (ADR are the only form I trust very much right now in a fast-moving organization context.) Every single section of this book will make you think about it, but won't give much of an overview: you're better served by reading the SRE book for example.

Value for money

This book is a short read which gives you an overview of what microservices challenges you're likely to face down that rabbit hole; in particular, it focuses on a medium-to-large organization context. I'm not sure this book is worth the price tag however: 20 pounds for a Kindle edition of ~170 pages, where ~25 pages are glossary, index and lots of checklists.

Book review: Algorithms to live by

2018-01-03T12:53:00.000+01:00

Algorithms to Live By: The Computer Science of Human Decisions is a book that puts together the domains of computer science and real life. The ensemble of topics being touched is wide. The book treats deterministic algorithms such as optimal sorting, but then moves on to more context-dependent strategies for caching and scheduling. The last chapter even get to model identification, (tractable and intractable-made-tractable) optimization problems, stochastic algorithms and game theory.

All the while, computer science concepts are compared to conscious and unconscious human processes. For example, caching and the memory hierarchy have great parallels with how the human brain recollects memories of recent events, and how we can augment our brain with external, slower supports like paper. Scheduling is useful not only to allocate processes on CPU cores, but also to make an explicit choice of strategy when prioritizing the tasks that you or your team face. Up to the more extreme examples of game theory and mechanism design, when the incentive system becomes more important than the individual agents (manage the system, not the people rings a bell?)

If you like viewing the world throughout the lens of algorithms and see how the strategies of humans and computer compare with each other, I would strongly recommend this book as it will make for an entertaining read and some principles to take away for real life usage (I hope sorting socks will be easier now). Skip it if you have a very wide knowledge of computer science, operations research, Nash equilibria... but even if I was familiar with the technical part, I was missing the connection to different domains or everyday, real world problems. I listened to the audiobook version, which lasts about 12 hours. You may find it easier to skim through some chapters if you are more (or less) interested in some topics. The problem with audiobooks is that I can't easily take notes, while highlighting on an e-book reader is quick and lets me recollect all important gotchas later into a text file.

New role: Software Engineer in Tools and Infrastructure

2018-01-03T09:36:00.001+01:00

After working on eLife's testing and deployment infrastructure in 2016, in the last year my responsibilities in the technical team have shifted towards the domain of engineering productivity. Testing is one phase of the development process that is often a bottleneck, but there are many more areas like code reviews, monitoring and infrastructure itself (being it servers or services):

In summary, the work done by the SETs naturally progressed from supporting only product testing efforts to include supporting product development efforts as well. Their role now encompassed a much broader Engineering Productivity agenda. -- Ari Shamash on the Google Testing Blog

Moreover, the team starts from a high level of coverage and design on many projects, to the point that my focus has always been on the provisioning and automation of testing environments, and on large-scale end2end testing.

What seems just a letter on a job title (from SET to SETI) is in fact an alignment of responsibilities so that I am not accidentally mistaken for "the QA guy" but always seen as a problem solver instead.

Solving problems and propagating the solution, so that you don't have to solve them over and over again

Roles are always an approximation in a team of generalizing specialist that also distributes and collaborate on some roles such as that of architecture. But it's helpful in a cross-functional team to have someone dedicated to the task of productivity, whether it is reached through automation, tooling, or continuous improvement.

Book review: Building Microservices

2017-11-30T22:33:00.003+01:00

Building Microservices: Designing Fine-Grained Systems by Sam Newman is the seminal book on microservices as a concept. It was published at the start of 2015 (that's a long time ago in tech... or is it?), it's focused on high-level topics rather than implementation and hence has aged well.

There are in fact several concepts, both at the methodology and the technical level, that the books does justice to. Here's what turned on the light bulb over my head.

Modeling

Modeling is important and a whiteboard discussion can save weeks of implementation down the line. Both me and Sam Newman are not the first to say this. Modelling in microservices, like in Domain-Driven Design, is all built around business capabilities and the shape of your organization (yes, Conway).

Styles of integration

Like for other design choices, especially at the architectural level, it's important to explicitly choose whether to go for a shared database (please don't), synchronous or asynchronous communication; orchestration through a Facade or coreography distributing responsiblities between services; explicit versioning the kinds of backward compatibility; pushing or pulling data from one physical location to another, and with which granularity of time and entity.
There are also styles of isolation, not just of integration: code reuse is maturely described on a trade-off scale with decoupling. It feels like a pattern book in which these options are given a standard name for further discussion, and evaluated with respect to the contexts in which they work well.

Deployment

Should you go for virtual machines or containers? How do you map services to physical or virtual machines? The book couldn't possibly keep up with the rise of container orchestrators in the last couple of years, so it won't be a complete guide but could give you a sense of the problems that virtual machines create and that we are going to solve in this next generation. What will be the problems that containers create, and most of all how to solve them, is not in the scope of this book instead.

Testing

After the basics like a testing pyramid, I don't find myself in complete agreement with large scale testing strategies proposed here like consumer-driven contract testing. Yes, it works well enough if you can specify a formal contract that a service should adhere to, and test it in isolation in the implementer service. But the overhead of doing so, in a context in which we are supposed to create dozens if not hundreds of service, is very significant.
At eLife we have relied on a wide RESTful API specification, each group of endpoints implemented by a different service. As such, the overhead is limited, and this is just a description for validating requests and responses rather than a full contract, as most of these services are read-only.
All in all, I find myself relying on the end2end beast to get heterogeneous services, written in multiple languages, in different times by different people, to talk together reliably. I would spend a lot of energy in trying to square the circle of contracts, but I suppose they work well at an higher scale of traffic or on selected services.
The end2end tests we use are limited in their scope, constrained by being at the top of the pyramid; they do not necessarily cover a full end2end scenario but rather a data path involving more than on service, often skipping the user interface. It helps that we have no Selenium-based testing in the end2end layer, as the user interface is fully accessible to a HTML parser and requires no Javascript.
The problems that we encounter daily happen in production all the same: timeouts, dirty data, the automation challenge of turning on and off new nodes reliably, the race conditions that come from distributed executions. I'd rather not hide these problems but solve them, and I'm looking at containers instead to try to shrink the big picture and have a simpler end2end environment, easier to spin up and down, or to provide with clean databases.

Hidden gems

There are several hidden gems that would let you pick the brain of the author on a common problem cited in a chapter. For example, we have such a problem in integrating with a CRM that doesn't even support PHP 7 (what is it with CRMs and always being sources of technical debt?) There are some example patterns that you could apply in that situation, like hiding the CRM behind a specialized service that cleans up its API. Nothing miraculous, but a glimmer of hope for these desperate situations.

Conclusions

If you are going to work with microservices, or milliservices, or small enough services, this book is worth a read. If you are only having troubles with a particular area, such as testing or security, going through a single chapter will give you a big picture before you go in depth with further sources.
Remember that this book is starting to become a bit dated, so you cannot take highly technical lessons from it (and I doubt books are a great tool for those in general in this fast-moving environment). Think about your context, learn the theory, and fill in the parts in which the map is blank (or erased) with what you are learning from the web in 2017 (soon to be 2018 - attempt to make this review valid for one more year).

Pipeline Conf 2017

2017-04-03T23:22:00.001+02:00

Not liquorice

Post originally shared on eLife's internal blog, but in the spirit of (green) open access here it is.

Last month I have attended the PIPELINE conference in London, 2017 edition. This event is a not-for-profit day dedicated to Continuous Delivery, the ability to get software changes into the hands of users, and to do so safely, quickly, and in a sustainable way. It is run by practitioners for practitioners, everyone on different sides of the spectrum like development, operations, testing, project management, or coaching.

The day is run with parallel tracks, divided into time slots of 40-minute talks and breaks for discussions and, of course, some sponsor pitches. I have been picking talks from the various tracks depending on their utility to eLife's testing and deployment platform, since our tech team has been developing every new project with this approach for a good part of 2016.

The conceptual model of Continuous Delivery and of eLife's implementation of it is not dissimilar to the scientific publishing process:

there is some work performed into an isolated environment, such as a laboratory, but also someone's laptop;
which leads to a transferable piece of knowledge, such as a manuscript, but also a series of commits, roughly speaking some lines of code;
which is then submitted and peer reviewed. We do so through pull requests, which perform a series of automated tests to aid human reviewers inside the team; part of the review is also running the code to reproduce the same results on a machine which is not that original laptop.
after zero or more rounds of revisions, this work gets accepted and published...
which means integrating it with the rest of human knowledge, typesetting it, organizing citations and lots of metadata about the newly published paper. In software, the code has to be transformed into an efficient representation, or virtual machines have to be configured to work with it.
until, finally, this new knowledge (or feature) is in the hands of a real person, who can read a paper or enjoy the new search functionalities

Forgive me for the raw description of scientific work.

In software, Continuous Delivery tries to automate and simplify this process to be able to perform it on microchanges multiple times per day. It aims for speed to be able to bring a new feature live in tens of minutes; it aims for safety to avoid breaking the users work on new changes; and to do all of this in a sustainable way, not to sacrifice tomorrow's ability to evolve for a quick gain today.

Even without the last mile of real user traffic, the 2.0 software services have been running in production or production-like servers from the first weeks of their development. A common anti-pattern in software development is to say "It works on my machine" (imagine some saying "It reproduces the results, but only with my microscope"); what we strive for is "It works on multiple machines, that can be reliably created; if we break a feature we know within minutes and can go back the latest version known to work."

Dan North: opening keynote

Dan North started to experiment with Continuous Delivery in 2004, at a time when builds were taking 2 days and a half to run in a testing environment contended by multiple teams. He spoke about several concepts underpinning Continuous Delivery:

conceptual consistency: the ability of different people to make similar decisions without coordination. It's an holy grail for scaling the efforts of an organization to more and more members and teams.
supportability: championing Mean Time To Repair over Mean Time Between Failures. The three important questions for facing a problem as what happened? Who is impacted? How do we fix it?
operability: what does it feel like to build your software? To deploy it? Test it? Releasing it? Monitor it? Support it? Essentially, developer experience in additio to user experience.

Operability is a challenge we have to face ourselves more and more as we move from running our own platform to provide open source software for other people to use. Not only reading an article has to be a beautiful experience, but publishing one should be.

John Clapham: team design for Continuous Delivery

This talk was more people-oriented, I agree with the speaker that engagement of workers is what really drive profits (or value in case of non-profits).
Practically speaking:

reward the right behaviors to promote the process you want;
ignore your job title as everyone's job is to deliver value together;
think small: it's easier to do 100 things 1% better than to do 1 thing 100% better (aka aggregation of marginal gains)

Abraham Marin: architectural patterns for a more efficient pipeline

The target for a build is for it to take less than 10 minutes. The speaker promotes the fastest builds as the one you don't have to run, introducing a series of patterns (and the related architectural refactorings) to be executed, safely, to simplify your software components:

decoupling an api from implementation: extracting an interface package to reduce the dependencies to a component to a dependency to an interface;
dividing responsibiliteis vertically or horizontally trying to isolate the most frequent changes and minimizing cross-cutting requirements;
transform a library into a service;
transform configuration into a service.

Some of these lessons are somewhat oriented to compiled languages, but not limited to them. My feeling is that even if you reduce compile times, you still have to test some components in integration, which is a large source of delay.

Steve Smith: measuring Continuous Delivery

How do you know whether a Continuous Delivery effort is going well? Or more pragmatically, which of your projects is in trouble?
The abstract parameters to measure in pipelines are speed (throughput, cycle time) and stability. Each declines differently depending on the context.
In deployment pipelines that go from a commit to a new version release in production, lead time and the interval of new deployments can be measured. But also failure rate (how many runs fail) and failure recovery time are interesting. In more general builds or test suites, execution time is a key parameter but a more holistic view includes interval (how frequent are builds executed).
I liked some of these metrics so much that they are now in my OKRs for the new quarter. Simplistic quote: you can't manage what you can't measure.

Alastair Smith: Test-driving your database

To continuously deploy new software versions, you need an iterative approach to evolve your database and the data within it. When you evolve, you also have to test every new schema change. Even in the context of stored procedures for maximum efficiency (and lock-in), Alastair showed how to write tests that can reliably run on multiple environments.

Rachel Laycock: closing keynote, Continuous Delivery at Scale

Rachel Laycock is the Head of technology for North America at Thoughtworks, the main sponsor of the conference. The keynote however had nothing to do with sales pitches. Here are some anti-patterns:

"We have a DevOps team" is an oxymoron, as that kind of team doesn't exist; what often happens is that the Ops team gets renamed."
Do we chose Kubernetes or Mesos?" as in getting excited about the technology before you understand the problem to solve.

The "at scale" in the title pushes for seeing automation as a way to build a self-serving platform, where infrastructure people are not bottlenecks but enablers for the developers to build their own services.
The best quote however really was "yesterday's best practice becomes tomorrow's anti-pattern". What we look for is not to be the first to market but to have an adaptable advantage, a product that can evolve to meet new demands rather than being a dead end.

Book review: Site Reliability Engineering

2017-03-28T22:59:00.002+02:00

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? -- the book tagline

Site Reliability Engineering - How Google runs production systems is a 2016 book about the ops side of Google services, and the set of principles and practices that underlie it.

A Site Reliability Engineer is a software engineer (in the developer sense) that designs and implement systems that automated what would otherwise be done manually by system administrators. As such, SREs have a directive for employing a minimum 50% of their time in development rather than firefighting and maintenance of existing servers (named toil in the book).

The book really is a collection of chapters, so you don't have to be scared by its size as you don't necessarily need to read it cover to cover. You can instead zoom in on the interesting chapters, being them monitoring, alerting, outage tracking or even management practices.

Principles, not just tools

I like books that reify and give names to concepts and principles, so that we can talk about those concepts and refer to them. This book gives precise definitions for the Google version of measures such as availability, Service Level Objectives and error budgets.
This is abstraction at work: even if the examples being used show Google specific tools like Borgmon or Outalator, the solutions are described at an higher abstraction level that makes them reusable. When load balancing practices are made generic enough to satisfy all of Google services, you can bet that they are reusable enough to be somewhat applicable to your situation.

Caveat emptor

Chances are that you're not Google: the size of your software development effort is order of magnitudes smaller, and even when it is of a comparable size it's not easy to turn an established organization into Google (and probably not desirable nor necessary.)
However, you can understand that Google and the other giants like Facebook and Amazon are shaping our industry through their economically irresistible software and services. Angular vs React is actually Google vs Facebook; containers vs serverless is actually Kubernetes vs Lambda which is actually Google vs Amazon. The NoSQL revolution was probably started by the BigTable and Dynamo papers... and so on: when you deployed your first Docker container, Google engineers had already been using similar technology for 10 years; as such, they can teach you a lot at a relatively little cost through the pages of a book. And it's better to be informed on what may come to a cloud provider near you in the next years.

Conclusions

It took some time to get through this book, but it gives a realistic picture of running systems that undergo a large scale of traffic and changes at the same time. Besides the lessons you can directly get from it, I would recommend it to many system administrators and "devops guys" as a way to think more clearly about which forces and solutions are at play in their datacenters (or more likely, virtual private clouds).

Eris 0.9.0 is out

2017-03-12T20:13:00.002+01:00

In 2016 I moved to another country and as a result of this change I didn't had much time to develop Eris further. Thankfully Eris 0.8 was already pretty much stable, and in this last period I could pick up development again.

What's new?

The ChangeLog for 0.9 contains one big new feature, multiple shrinking. While minimization of failing test cases is usually performed with a single linear search, multiple shrinking features a series of different options for shrinking a value.
For example, the integer 1234 was usually shrunk to 1233, 1232, 1231 and so on. With multiple shrinking, there are a series of options to explore that make the search logarithmic, such as 617, 925, 1080, 1157, 1195, 1214, 1224, 1129, and 1231. If the simplest failing values is below 617 for example, at least (1234-617) runs of the test will be skipped by this optimization, just in the first step.
This feature is the equivalent of QuickCheck's (and other property-based testing libraries') Rose Trees, but implemented here with an object-oriented approach that makes use of `GeneratedValueSingle` and `GeneratedValueOptions` as part of a Composite pattern.

This release also features support for the latest versions of basic dependencies:

PHPUnit 6.x is now supported
PHP 7.1 is officially supported (I expect there were mostly no issues in previous releases, but not the test suite fully passes.)

Several small bugs were fixed as part of feedback from projects using Eris:

the pos() and neg() generators should not shrink to 0.
float generation should never divide by 0.
shrinking of dates fell into a case of wrong operator precedence.
reproducible PHPUnit commands were not escaped correctly in presence of namespaced classes.

A few backward compatibility fixes were necessary to make room for new features:

minimumEvaluationRatio is now a method to be called, not a private field.
GeneratedValue is now an interface and not a class. This is supposed to be an internal value: project code should never depend on it and it should build custom generators with map() and other composite generators rather than implementing the Generator interface, which is much more complex.
the Listener::endPropertyVerification() method now takes the additional parameters $iterations and the optional $exception. When creating listeners should always subclass EmptyListener in order not to have to modify the not implemented methods, which will be inherited.

What's next?

My Trello board says:

still decoupling from PHPUnit, for usage in scripts, mainly as a programmable source of randomness.
more advanced Generators for finite state machines and in general a more stateful approach, for testing stateful systems.
faster feedback for developers, like having the option to run fewer test cases in a development environment but the full set in Continuous Integration.

I'm considering opening up the Trello board for public read-only visibility, as there's nothing sensible in there, but potential value in transparency and feedback from the random people encountering the project for the first time.

As always, if you feel there is a glaring feature missing in Eris, feel free to request it on the Github's project issues.

Book review: Fifty quick ideas to improve your tests

2017-02-12T17:36:00.001+01:00

Fifty quick ideas to improve your tests is, well, a series of fifty quick ideas that you can implement on some of your automated test suites to improve their value or lower their creation or maintenance costs.

These ideas are pattern-like in which they are mostly self-contained and often independent from each other. They are distilled from real world scenarios that the authors (David Evans, Tom Roden and Gojko Adzic) have encountered in their work.

This format helps a lot readability, as ideas are organized into themes, giving you the ability to focus on the area you want to improve and to quickly skip the ideas that do not make sense in your context or you find impractical or not worth the effort. For the same reasons I enjoyed the book this one is a sequel to, Fifty quick ideas to improve your user stories. Moreover, both were published on Leanpub so you have the ability to adjust the price to the value you think you'll get out of them; and despite Leanpub's large collection of unfinished books, this one is 100% complete and ready to read without the hassle of having to update to a new version later (who really does that?)

Some selected quotes follow, highlighted phrases mine.

Something one person considers critical might not even register on the scale of importance for someone from a different group.

Drawing parallels between the different levels of needs, we can create a pyramid of software quality levels: Does it work at all? What are the key features, key technical qualities? Does it work well? What are the key performance, security, scalability aspects? Is it usable? What are the key usability scenarios? Is it useful? What production metrics will show that it is used in real work? Is it successful?

In order to paint the big picture quickly, we often kick things off with a ten-minute session on identifying things that should always happen or that should never be allowed. This helps to set the stage for more interesting questions quickly, because absolute statements such as ‘should always’ and ‘should never’ urge people to come up with exceptions.

Finally, when an aspect of quality is quantified, teams can better evaluate the cost and difficulty of measuring. For example, we quantified a key usability scenario for MindMup as ‘Novice users will be able to create and share simple mind maps in under five minutes’. Once the definition was that clear, it turned out not to be so impossible or expensive to measure it.

Avoid checklists that are used to tick items off as people work (Gawande calls those Read-Do lists). Instead, aim to create lists that allow people to work, then pause and review to see if they missed anything (‘Do-Confirm’ in Gawande’s terminology).

A major problem causing overly complex examples is the misunderstanding that testing can somehow be completely replaced by a set of carefully chosen examples. For most situations we’ve seen, this is a false premise. Checking examples can be a good start, but there are still plenty of other types of tests that are useful to do. Don’t aim to fully replace testing with examples in user stories – aim to create a good shared understanding, and give people the context to do a good job.

Waiting for an event instead of waiting for a period of time is the preferred way of testing asynchronous systems

The sequence is important: ‘Given’ comes before ‘When’, and ‘When’ comes before ‘Then’. Those clauses should not be mixed. All parameters should be specified with ‘Given’ clauses, the action under test should be specified with the ‘When’ clause, and all expected outcomes should be listed with ‘Then’ clauses. Each scenario should ideally have only one ‘When’ clause that clearly points to the purpose of the test.

Difficult testing is a symptom, not a problem. When it is difficult for a team to know if they have a complete picture during testing, then it will also be difficult for it to know if they have a complete picture during development, or during a discussion on requirements. It’s unfortunate that this complexity sometimes clearly shows for the first time during testing, but the cause of the problem is somewhere else.

Although it’s intuitive to think about writing documents from top to bottom, with tests it is actually better to start from the bottom. Write the outputs, the assertions and the checks first. Then try to explain how to get to those outputs. [...] Starting from the outputs makes it highly unlikely that a test will try to check many different things at once,

Technical testing normally requires the use of technical concepts, such as nested structures, recursive pointers and unique identifiers. Such things can be easily described in programming languages, but are not easy to put into the kind of form that non-technical testing tools require.

For each test, ask who needs to resolve a potential failure in the future. A failing test might signal a bug (test is right, implementation is wrong), or it might be an unforeseen impact (implementation is right, test is no longer right). If all the people who need to be make the decision work with programming language tools, the test goes in the technical group. If it would not be a technical but a business domain decision, it goes into the business group.

Manual tests suffer from the problem of capacity. Compared to a machine, a person can do very little in the same amount of time. This is why manual tests tend to optimise human time [...] Since automated tests are designed for unattended execution, it’s critically important that failures can be investigated quickly. [...] To save time in execution, it’s common for a single manual test to check lots of different things or to address several risks.

Whenever a test needs to access external resources, in particular if they are created asynchronously or transferred across networks, ensure that the resources are hidden until they are fully complete.

Time-based waiting [sleep() instead of polling or waiting for an event in tests] is the equivalent of going out on the street and waiting for half an hour in the rain upon receiving a thirty-minute delivery estimate for pizza, only to discover that the pizza guy came along a different road and dropped it off 10 minutes ago.

Instead of just accepting that something is difficult to test and ignoring it, investigate whether you can measure it in the production environment.

Test coverage is a negative metric: it measures how bad something is, not how good it is.

There are just two moments when an automated test provides useful information: the first time it passes and when it subsequently fails.

It’s far better to optimise tests for reading than for writing. Spending half an hour more writing a test will save days of investigation later on. [...] In business-oriented tests, if you need to compromise either ease of maintenance or readability, keep readability.

Book review: The Power of Habit

2016-12-29T17:44:00.001+01:00

Charles Duhigg, a New York Times reporter, collects stories of building and breaking habits, supporting the thesis that habits form an important part of our lives and that they can make a big difference for better or worse. Both in the case of positive training or learning habits, or in the case of addictions, repeated behavior influences our energy levels, our free time and in the end many of our long-term results at work and in life (we will all go the gym in the new year, right?)

The storytelling style of the book may smell like anecdotal evidence, but it keeps the reader intruigued and entertained long enough the get its message across, without delving into fictional stories. Take the science expressed here with a grain of salt (like you would with Malcom Gladwell): all experiments are real but they may have been cherry-picked to prove a point.

The key take away for me was to think about our habits, try to influence them to stop or reinforce them depending on our second-order desires; for example with the cue-routine-reward framework proposed in the book but ultimately with whatever works for you as habit building and destroying must be very context-specific. Another concept that we find reasonable is ego depletion (willpower as a finite resource that must be renewed), but the jury is still out on whether it is a confirmed and sizable effect, as meta-analysis of hundreds of studies do not agree yet (for good reasons).

From exercising to learning, or from quitting smoking to a Facebook addiction, this self-reflection can have a large impact over our lives. Maybe it should be an habit?

Selected quotes from the book follow:

This process within our brains is a three-step loop. First, there is a cue, a trigger that tells your brain to go into automatic mode and which habit to use. Then there is the routine, which can be physical or mental or emotional. Finally, there is a reward, which helps your brain figure out if this particular loop is worth remembering for the future [...] Every McDonald’s, for instance, looks the same—the company deliberately tries to standardize stores’ architecture and what employees say to customers, so everything is a consistent cue to trigger eating routines.

“Even if you give people better habits, it doesn’t repair why they started drinking in the first place. Eventually they’ll have a bad day, and no new routine is going to make everything seem okay. What can make a difference is believing that they can cope with that stress without alcohol.”

Where should a would-be habit master start? Understanding keystone habits holds the answer to that question: The habits that matter most are the ones that, when they start to shift, dislodge and remake other patterns.

“Small wins are a steady application of a small advantage,” one Cornell professor wrote in 1984. “Once a small win has been accomplished, forces are set in motion that favor another small win.” Small wins fuel transformative changes by leveraging tiny advantages into patterns that convince people that bigger achievements are within reach.

“Sometimes it looks like people with great self-control aren’t working hard—but that’s because they’ve made it automatic”

“By making people use a little bit of their willpower to ignore cookies, we had put them into a state where they were willing to quit much faster,” Muraven told me. “There’s been more than two hundred studies on this idea since then, and they’ve all found the same thing. Willpower isn’t just a skill. It’s a muscle, like the muscles in your arms or legs, and it gets tired as it works harder, so there’s less power left over for other things.”

As people strengthened their willpower muscles in one part of their lives—in the gym, or a money management program—that strength spilled over into what they ate or how hard they worked. Once willpower became stronger, it touched everything.

How to run Continuous Integration on EC2 without breaking the bank

2016-11-14T00:12:00.000+01:00

I live in a world increasingly made up of services (sometimes micro-), that collaborate to produce some visible behavior. This decomposition of software into (hopefully) loosely coupled components has implications for production infrastructure concern such as how many servers or containers you need to run.
Consequently, it impacts also testing infrastructure as it tries to be an environment as close as possible to production, in order to reliably discover bugs and replicate them in a controlled environment without impacting real users.
In particular I like to have a ci testing environment where each service can be tested in its own virtual machine (or container for some); and each node is totally isolated from the other services. In addition to that, I also like to have and end-to-end testing environment where services talk to each other as they would in production, and where we can run long and complex acceptance tests.
This end-to-end environment usually is a perfect copy of production with respect to the technologies being used (e.g. load balancers like HAProx or AWS ELB are in place, in the same way of production, even if there is no test that directly targets their existence); the number of nodes per service is however reduced from N to 2, as in computing there are only 0, 1 and N equivalence categories.

In the likely case that you're using cloud computing infrastructure to manage this 2x or 3x volume of servers with respect to the production infrastructure, your costs are also by default going to double or triple. One option to try and optimize this is to throw away everything and start deploying containers, as they could share the same underlying virtual machines as production while preserving isolation and reproducibility. On an existing architecture made up of AWS EC2 nodes however, optimization can takes us far without requesting to rewrite all the DevOps(TM) work of the last two years.

Phase 1: expand

As I've been explaining, EC2 instances replicating production environments can expand until they bring the total number of EC2 nodes to three times the original number. Some project just have an single EC2 node, while others have multiple nodes that have to be at least 2 in the latest testing environment before production. Moreover, the time the tests take to run on these instances is inversely correlated with how powerful they are in CPU and I/O terms, so you pay good money for every speed improvement you want to get on those 20- or 60-minute suites.
In my current role at eLife, we initially got to more than 20 EC2 instances for the testing environments. This was beneficial from a quality and correctness point of view, as we could then run tests on all mainline branches before they go to production, but also on pull requests, giving timely feedback on the proposed changes without requiring developers to run everything on their machines (that should be an option, not an imperative.)

Phase 2: optimize

The AWS EC2 pricing model works by putting nodes into a running state when you launch them, and by pre-billing one hour of usage every 60 minutes. Therefore, booting existing instances or creating from scratch is going to incur at least a 1-hour cost for each of these events:

at boot, 1 hour is billed
at 60:00 from boot a new hour is billed,
at 120:00 from boot a new hour is billed, and so on

All the EC2 nodes that we have however, use EBS disks for their root volumes. EBS is the block remote storage provided by AWS, and while some generations of instances use local instance storage for their / partition, EBS makes a lot of sense for that partition as it gives you the ability to starting and stopping instances without losing state in between; essentially, it gives you the ability to shutdown and reboot instances when you need without paying EC2 bills for the time in which they are stopped. The only billing being performed is for the EBS storage space, which means AWS has some hard disks in its data centers that has to preserve your instances files, but it allocates new CPU and RAM resources as a virtual machine only when you start the EC2 instance. Therefore, an EBS-backed EC2 instance on your AWS console does not always correspond to a physical place, but it's really virtual as it can be stopped and started many times, moving between racks in the same availability zone while keeping the same data (persisted to multiple disks in other racks) and even private IP address.
Since the process of allocating a new virtual machine and reconfiguring networks to connect everything together has some overhead, this reflects in the boot time necessary to start an EC2 instance, which can be several seconds to be added to the standard boot time for the operating system. This is also reflected in the pricing model, which makes it a bad idea to launch a new EC2 instance for every test suite you need to run: as long as your test suite takes less than 1 hour, you are already paying for a full hour of resources and you would be throwing away. Running 6 builds in an hour would make you pay for 6 hours, which is not what you want.

Phase 2.1: stop them manually

A first optimization that can be performed manually is to stop and start these instances from the console. You would usually stop them at some hour of the day or evening and then start them again as the first thing in the morning.
Of course, there is a good potential for automating this as AWS provides many ways to access EC2 with its API and all the SDKs that build on top of it, available for many different programming languages. You can easily build some commands that query the EC2 API looking for an instance basing on its tags, and then issue commands for starting and stopping it. In general, this is almost transparent for the CloudFormation templates that you are surely using for launching these instances.
The first time you start and stop an instance, there are a few problems thay may come up.
The first problem is that of ephemeral storage: as I wrote before, you have to make sure the root volume of the instance and any data you want to persist are EBS-backed and not local instance storage.
The second problem is that of public IP addresses. While private IP addresses inside a VPC stay the same after a stop and start commands, public IP addresses are a scarce resource and are only allocated to it when the instance is running. Therefore, if you had a DNS pointing to it, it has to be updated after the boot, whether it was manually created or part of the CloudFormation template. Default DNS entries have the form ec2-public-ip-address.compute-1.amazonaws.com which depends on the public ip address and hence does not provide a good indirection.
The third problem is that of long-running processes managed by SysV/Upstart/Systemd: the daemons of servers like Apache, Nginx or MySQL are usually configured to restart upon boot, but if you have written your own deamons or Python/PHP long running processes and are starting them through /etc/init or /etc/init.d configuration, it pays to check everything is in its place again after boot.
The last problem I have found at this level (manual restarts) is about files in /run and /var/run, which are temporary directory used by deamons to place locks and other transient files like a pidfile indicating an instance of that program is running. If you have folders in /run or /var/run, those folder will have to be recreated. Systemd provides the tmpfiles.d option which automatically creates a hierarchy of files, but it's usually just easier (and portable) to have the daemons create their folders (php-fpm does that) or if they are not able to do that, not placing them in /var/run/some_folder_that_will_stop_existing but in /var/run or even /tmp without subfolders.

Phase 2.2: start them on demand

Instead of manually starting EC2 instance or to automate their stopping and starting as a periodical task, you can also start them on-demand as needed by the various builds that need to be run. So whenever project x needs to build a new commit on master or a pull request, you will start the x--ci EC2 instance.
In this case, however, there is a larger potential for race conditions as you may try to run a deploy or any command on an instance before it's actually ready to be used. Therefore, we wrote some automation code that waits for several events before letting a build proceed:

the instance must have gone from the pending to the running state on the EC2 API. This hopefully means AWS has found a CPU and other resources to assign to it.
the instance must be accessible through SSH.
through SSH, we monitor that the file /var/lib/cloud/instance/boot-finished has appeared. This file will appear at each boot when all daemons have been started, as art of the standard cloud-init package.

Once the instance has gone through all these phases, you can SSH into it and run whatever you want.

Phase 2.3: stop them when it's more efficient

Once you have transitioned from starting instances in the last responsible moment, you can do the same for stopping them instead of just wait for the end of the day to shutdown everything.
We now have a periodical job, running every 2 minutes, that takes a list of servers to stop. In parallel, it performs the following routine for each of the EC2 instances:

checks if the server has been running for an amount of time between h:55:00 and h:59:59 minutes, where h is some number of hours.
if the condition is true, stop the instance before we incur in a new hour being billed.
otherwise, leave the instance running: you already paid for this hour so it makes no harm to do so, as the instance can be used to run new builds at no cost.

Therefore, when developers open a dozen pull requests on the same project, only the first starts the necessary instances to run the tests; the other ones are queued behind that and will get access to the same instance, one at a time.

Bonus: Jenkins locks

Starting and stopping instances periodically would be otherwise dangerous if there wasn't a mechanism for mutual exclusion between builds and lifecycle operations like starting and stopping. Not only you don't want to run builds for the same project on the same instance if they interfere with each other, but you definitely don't want an instance to be shutdown while a build is still running.
Therefore, we wrap both these lifecycle operations and builds in locks for resource, using Jenkins Lockable Resources plugin. If the periodical stopping task tries to stop an instance where the build is running, it will have to wait to acquire the lock. This ensures that machines that see many builds do not get easily stopped, while other ones that are idle will be stopped at the end of their already paid hour.

Conclusions

Cloud computing is meant to improve the efficiency with which we allocate resources from someone else's data centers: you pay for what you use. Therefore, with a little persistence provided by EBS volumes you can efficiently pay for the hours that your builds require, and not for keeping idle EC2 instances running every day of the year. Of course, you'll need some tweaking of your automation tools to easily start and stop instances; but it is a surefire investment that can usually save more than half of the cost of your testing infrastructure by putting it at rest in weekend and non-working hours.

Deep learning: an introduction for the layperson

2016-10-02T19:20:00.000+02:00

Deep learning is one of the buzzwords of 2016, promising to revolutionize the world possibly without Skynet gaining self-awareness.

Last week I held an internal talk at eLife Sciences to introduce colleagues from both the software and scientific backgrounds to the concept.

Here are the slides, complete with notes (accessible through a popup by pressing S) that explain what the diagrams and other figures mean.

Eris 0.8.0 is out

2016-05-22T21:35:00.002+02:00

In the period before my move to Cambridge I got some time to work on Eris, and to use it to test the guts of Onebip's infrastructure. Lots of new features are now incorporated in the 0.8.0 version, along with a modernization of PHP standards compliance carried out by @localheinz.

What's new?

Here's the most important news, a selection from the ChangeLog:

The bind Generator allows to use the random output of a Generator to build another Generator.
Optionally logging generations with `hook(Listener\log($filename))`.
disableShrinking() option.
limitTo() accepts a DateInterval to stop tests at a predefined maximum time.
Configurability of randomness: choice between rand, mt_rand, and a pure PHP Mersenne Twister.
The suchThat Generator accepts PHPUnit constraints like `when()`.

Some bugs and annoyances were fixed:

No warnings on PHP 7 anymore.
Fixed bug of size not being fully explored due to slow growth.
Switched to PSR-2 coding standards and PSR-4 autoloading.

And there were some backward compatibility breaks (we are in 0.x after all):

The frequency generator only accepts variadics args, not an array anymore.
Removed strictlyPos and strictlyNeg Generators as duplicated of pos and neg ones.
Removed andAlso, theCondition, andTheCondition, implies, imply aliases which expand the surface area of the API for no good reason. Added and for multiple preconditions.

Eris is now quite extensible with custom Generators for new types of data; custom Listeners to know what's going on; and even different sources of randomness to tune repeatability and performance.
I believe what's very important about this release is the release of technical documentation. This is not a list of APIs generated by parsing the code, but is a full manual of Eris features, which will be kept up-to-date religiously in the repository itself and rebuilt automatically at each commit.

What's next?

My Trello board says:

decoupling from PHPUnit: it should be possible to run Eris also with PHPSpec (already possible but not as robustly as it can be) or in scripts.
Multiple possibilities for shrinking, borrowing from test.check rose trees. This feature may speed up the shrinking process and make it totally deterministic.
A few more advanced Generators: for example testing Finite State Machines.

If you are using Eris and wanna give feedback, feel free to open a Github issue to discuss.

Next stop: Cambridge

2016-04-19T12:03:00.003+02:00

Last Friday has been my last working day in Onebip, the carrier billing payment platform headquartered in Milan. I leave the best technical team I have ever worked with, who has tackled endless challenges from transitioning to a microservice architecture, to adopting CQRS and Event Sourcing, and testing a large product depending on the integration with 400 mobile carriers.

In May, I will start a new adventure as a Software Engineer in Test at eLife. Located in Cambridge, eLife is an open access journal that publishes scientific articles in the fields of biology and medicine, with the goal of improving the peer review process and accelerating science. As a non-profit organization, it's quite a different context with respect to selling product and services, but indeed a potentially large and positive impact on the world.

Cambridge is a city of research and technology, and welcomes students and scientists, but also software developers like me. Moreover, it's small and peaceful (you can cycle around anywhere), while showing peaks of high technical level. It's the first place where I have been to a study group on the book Structure and Interpretation of Computer Programs, or to a quite good introductions to machine learning talk (and not to Transpile typed ECMAScript without left-pad nor using arrays because you would need a polyfill for that or some other hipster hallucination).

See you on the other side of the Channel...