Invisible to the eye: January 2019

Thursday, January 10, 2019

Practical Helm in 5 minutes

Yet another ship-themed name

Containerization is increasingly a powerful way to deploy applications on anonymous infrastructure, such as a set of many identical virtual machines run by some cloud provider. Since container images ship a full OS, there is no need to manage packages for the servers (a PHP or Python interpreter), but there are still other environment-specific choices that need to be provided to actually run the application: configuration files and environment variables, ports, hostnames, secrets.

In an environment like Kubernetes, you would create all of this declaratively, writing YAML files describing each Pod, ConfigMap, Service and so on. Kubernetes will take these declarations and apply them to its state to reach what is desired.

As soon as you move outside of a demo towards multiple environments, or towards updating one, you will start to see Kubernetes YAML resources not directly as code to be committed into a repository, but as an output of a generation process. There are many tweaks and customizations that need to be performed in each environment, from simple hostnames (staging--app.example.com vs app.example.com) to entire sections being present or not (persistence and replication of application instances).

The problem you need to solve then is to generate Kubernetes resources from some sort of templates: you could choose any template engine for this task, and execute kubectl apply on the result. To avoid reinventing the wheel, Helm and other competitors were created to provide an higher abstraction layer.

Enter Helm

Helm provides templating for Kubernetes .yaml file; as part of this process, it extracts the configuration values for Kubernetes resources into a single, hierarchical data source.

Helm doesn't stop there however: it aims to be a package manager for Kubernetes, hence it won't just create resources such as a Deployment, but it will also:

apply the new resources on the Kubernetes cluster
tag the Deployment with metadata and labels
list everything that is installed in terms of applications, rather then Deployments and ConfigMaps
find older versions of the Deployment to be replaced or removed

The set of templates, helpers, dependencies and default values Helm uses to deploy an application is called a chart whereas every instance of a chart created on a cluster is called a release. Therefore, Helm keeps track of objects in terms of releases and allows you to update a release and all its contents, or to remove it and replace it with a new one.

Folder structure

The minimal structure of an Helm chart is simply a folder on your filesystem, whose name must be the name of the chart. As an example, I'll use green-widgets as a name, a fictional web application for ordering green widgets online.

This is what you'll see inside a chart:

Chart.yaml: metadata about the chart such as name, description and version.
values.yaml: configuration values that may vary across releases. At a bare minimum the image name and tag will have defaults here, along with ports to expose.
the templates/ subfolder: contains various YAML templates that will be rendered as part of the process of creating a new release. There is more in this folder like a readme for the user and some helper functions for generating common snippets.

Apart from this minimal setup, there may also be a requirements.yaml file and a charts/ subfolder to deal with other charts to use as dependencies; for example, to install a database through an official chart rather than setting up PostgreSQL replication on your own. These can be safely ignored until you need these features though.

Once you have the helm binary on your system, you can generate a new chart with helm create green-widgets.

Cheatsheet

You can download a helm binary for your platform from the project's releases page on Github. The helm init command will use your kubectl configuration (and authentication) to install tiller, the server-side part of Helm, onto a cluster's system namespace.

Once this is setup, you will be able to execute helm install commands against the cluster, using charts on your local filesystem. For real applications, you can install official charts that are automatically discovered from the default Helm repositories.

The command I prefer to use to work on a chart however is:
helm upgrade --install --set key=value green-widgets--test green-widgets/

The mix of upgrade and install means this command is idempotent and will work for the first installation as well as for updates. Normally you would issue a new release for a change to the chart, but this approach allows you to test out a chart while it's in development, using a 0.0.1 version.
There is no constraint on the release name green-widgets--test, and Helm can even generate random names for you. I like to use the application name and its environment name as a team convention, but you should come up with your own design choices.

A final command to keep in mind is helm delete green-widgets--test which will delete the release and all the resources created by your templates. This is enough to stop using CPU, memory and IP addresses, but it's not enough to completely remove all knowledge of the release from Tiller's archive. To do so (and free the release name allowing its re-creation) you should use add the --purge flag.

Caveats

This 5-minute introduction makes it all seem plain and simple, but it should be clear that simply downloading Helm and installing it is not a production-ready setup. I myself have only rolled out this setup to testing environment at the time of writing.

I can certainly see several directions to explore, that I either cut from the scope in order to get these environments up and running for code review; or investigated and used but not included in this post. For example:

requirements.yaml allows to include other charts as dependencies. This is very powerful for off-the-shelf open source software such as databases, caches and queues; it needs careful choices for the configuration values being passed to these dependencies, and your mileage may vary with the quality of the chart you have chosen.
chart repositories are a good way to host stable chart versions rather than copying them onto a local filesystem. For example, you could push tarballs to S3 and have a plugin regenerate the index.
the whole Helm and Tiller setup arguably needs to be part of a Infrastructure as Code apporach like the rest of the cluster. For example, I am creating a EKS cluster using Terraform and that would need to include also the installation and configuration of Tiller to provide a turnkey solution for new clusters.

Wednesday, January 02, 2019

The path from custom VM to VM with containers

https://commons.wikimedia.org/wiki/File:Kanda_container.jpg

Image of a single container being transported by OiMax

Before the transition to Docker containers started at eLife, a single service deployment pipeline would pick up the source code repository and deploy it to one or more virtual machines on AWS (EC2 instances booted from a standard AMI). As the pipeline went across the environments, it repeated the same steps over and over in testing, staging and production. This is the story of the journey from a pipeline based on source code for every stage, to a pipeline deploying an immutable container image; the goal pursued here being the time savings and the reduced failure rate.

The end point is seen as an intermediate step before getting to containers deployed into an orchestrator, as our infrastructure wasn't ready to accept a Kubernetes cluster when we started the transition, nor Kubernetes itself was trusted yet for stateful, old-school workloads such as running a PHP applications that writes state on the filesystem. Achieving containers-over-EC2 allows developers to target Docker as the deployment platform, without realizing yet cost savings related to the bin packing of those containers onto anonymous VMs.

Starting state

A typical microservice for our team would consist of a Python or PHP codebase that can be deployed onto a usually tiny EC2 instance, or onto more than one if user-facing. Additional resources that are usually not really involved in the deployment process are created out of band (with Infrastructure as Code) for this service, like a relational database (outsourced to RDS), a load balancer, DNS entries and similar cloud resources.

Every environment replicates this setup, whether it is a ci environment for testing the service in isolation, or an end2end one for more large-scale testing, or even a sandbox for exploratory, manual testing. All these environments try to mimic the prod one, especially end2end which is supposed to be a perfect copy on fewer resources.

A deployment pipeline has to go through environments as a new release is promoted from ci to end2end and prod. The amount of work that has to be repeated to deploy from source on each of the instances is sizable however:

ensure the PHP/Python interpreter is correctly setup and all extensions are installed
checkout the repository, which hopefully isn't too large
run scripts if some files need to be generated (from CSS to JS artifacts and anything similar)
installing or updating the build-time dependencies for these tasks, such as a headless browser to generate critical CSS
run database migrations, if needed
import fixture data, if needed
run or update stub services to fill in dependencies, if needed (in testing environments)
run or update real sidecar services such as a queue broker or a local database, if present

These ever-expanding sequence of operations for each stage can be optimized, but in the end the best choice is not to repeat work that only needs to be performed once per release.

There is also a concern about the end result of a deploy being different across environments. This difference could be in state, such as a JS asset served to real users being different from what you tested; but also in outcome, as a process that can run perfectly in testing may run into a APT repository outage when in production, failing your deploy halfway through, only on one of the nodes. Not repeating operations leads not just to time savings but to a simpler system in which fewer operations can fail just because there are fewer of them in general.

Setting a vision

I've automated before builds that generated a set of artifacts from the source code repository and then deploy that across environments, for example zipping all the PHP or Python code into an archive or in some other sort of package. This approach works well in general, and it is what compiled languages naturally do since they can't get away with recompiling in every environment. However, artifacts do not take into account OS level dependencies like the Python or PHP version with their configuration, along with any other setup outside of the application folder: a tree of directories for the cache, users and groups, deb packages to install.

Container images promise to ship a full operating system directory tree, which will run in any environment only sharing a kernel with its host machine. Seeing docker build as the natural evolution of tar -cf ... | bzip2, I set out to port the build processes of the VMs into portable container images per each service. We would then still be deploying these images as the only service on top an EC2 virtual machine, but each deployment stage should just be consisting of pulling one or more images and starting them with a docker-compose configuration. The stated goal was to reduce the time from commit to live, and the variety of failures that can happen along the way.

Image immutability and self-sufficiency

To really save on deployment time, the images being produced for a service must be the same across environments. There are some exceptions like a ci derivative image that adds testing tools to the base one, but all prod-like environment should get the same artifact; this is not just for reproducibility but primarily for performance.

The approach we took was to also isolate services into their own containers, for example creating two separate fpm and nginx images (wsgi and nginx for Python); or to use a standard nginx image where possible. Other specialized testing images like our own selenium extended image can still be kept separate.

The isolation of images doesn't just make them smaller than a monolith, but provides Docker specific advantages like leveraging independent caching of their layers. If you have a monolith image and you modify your composer.json or package.json file, you're in for a large rebuild. But segregating responsibilities leads instead to only one or two of the application images being rebuilt: never having to reinstall those packages for Selenium debugging. This can also be achieved by embedding various targets (FROM ... AS ...) into a single Dockerfile, and having docker-compose build one of them at a time with the build.target option.

When everything that is common across the environments is bundled within them, what remains is configuration in the form of docker-compose.yml and other files:

which container images should be running and exposing which ports
which commands and arguments the various images should be passed when they are started
environment variables to pass to the various containers
configuration files that can be mounted as volumes

Images would typically have a default configuration file in the right place, or be able to work without one. A docker-compose configuration can then override that default with a custom configuration file, as needed.

One last responsibility of portable Docker images is their definition of a basic HEALTHCHECK. This means an image has to ship enough basic tooling to, for example, load a /ping path on its own API and verify a 200 OK response is coming out. In the case of classic containers like PHP FPM or a WSGI Python container, this implies some tooling will be embedded into the image to talk to the main process through that protocol rather than through HTTP.

It's a pity to reinvent the lifecycle management of the container (being started, then healthy or unhealthy after a series of probes), whereas we can define a simple command that both docker-compose or actual orchestrators like Kubernetes can execute to detect the readiness of the new containers after deploy. I used to ship smoke tests with the configuration files to use, but these have largely been replaced by polling for an health status on the container itself.

Image size

Multi-stage builds are certainly the tool of choice to keep images small: perform expensive work in separate stages, and whenever possible only copy files into the final stage rather than executing commands that use the filesystem and bloat the image with their leftover files.

A consolidated RUN command is also a common trick to bundle together different processes like apt-get update and rm /var/lib/apt/lists/* so that no intermediate layers are produced, and temporary files can be deleted before a snapshot is taken.

To find out where this optimization is needed however, some introspection is needed. You can run docker inspect over a locally built image to check its Size field and then docker history to see the various layers. Large layers are hopefully being shared between one image and the next if you are deploying to the same server. Hence it pays to verify that if the image is big, most of its size should come from the ancestor layers and they should seldom change.

A final warning about sizes is related to images with many small files, like node_modules/ contents. These images may exhaust the inodes of the host filesystem well before they fill up the available space. This doesn't happen when deploying source code to the host directly as files can be overwritten, but every new version of a Docker image being deployed can easily result in a full copy of folders with many small files. Docker's prune commands often help by targeting various instance of containers, images and other leftovers, whereas df -i (as opposed to df -h) diagnoses inodes exhaustion.

Underlying nodes

Shipping most of the stack in a Docker image makes it easier to change it as it's part of an immutable artifact that can be completely replaced rather than a stateful filesystem that needs backward compatibility and careful evolution. For example, you can just switch to a new APT repository rather than transition from one to another by removing the old one; only install new packages rather than having to remove the older ones.

The host VMs become leaner and lose responsibilities, becoming easier to test and less variable; you could almost say all they have to run is a Docker daemon and very generic system software like syslog, but nothing application-specific apart from container dependencies such as providing a folder for config files to live on. Whatever Infrastructure as Code recipes you have in place for building these VMs, they will become easier and faster to test, with the side-effect of also becoming easier to replace, scale out, or retire.

An interesting side effect is that most of the first stages of projects pipelines lost the need for a specific CI instance where to deploy. In a staging environment, you actually need to replicate a configuration similar to production like using a real database; but in the first phases, where the project is tested in isolation, the test suite can effectively run on a generic Jenkins node that works for all projects. I wouldn't run multiple builds at the same time on such a node as they may have conflicts on host ports (everyone likes to listen on localhost:8080), but as long as the project cleans up after failure with docker-compose down -v or similar, a new build of a wholly different project can be run with practically no interaction.

Transition stages

After all this care in producing good images and cleaning up the underlying nodes, we can look at the stages in which a migration can be performed.

A first rough breakdown of the complete migration of a service can be aligned on environment boundaries:

use containers to run tests in CI (xUnit tools, Cucumber, static checking)
use containers to run locally (e.g. mounting volumes for direct feedback)
roll out to one or more staging environments
roll out to production

This is the path of least resistance, and correctly pushes risk first to less important environments (testing) and only later to staging and production; hence you are free to experiment and break things without fear, acquiring knowledge of the container stack for later on. I think it runs the risk of leaving some projects halfway, where the testing stages have been ported but production and staging still run with the host-checks-out-source-code approach.

A different way to break this down is perform the environment split by considering the single processes involved. For example, consider an application with a server listening on some port, a cli interface and a long-running process such as a queue worker:

start building an image and pulling it on each enviroment, from CI to production
try running CLI commands through the image rather than the host
run the queue worker to the image rather than the host
stop old queue worker
run the server, using a different port
switch the upper layer (nginx, a load balancer, ...) to use the new container-based server
stop old server
remove source code from the host

Each of these slices can go through all the environments as before. You will be hitting production sooner, which means Docker surprises will propagate there (it's still not as stable as Apache or nginx); but issues that can only be triggered in production will happen on a smaller part of your application, rather than as a big bang of the first production deploy of these container images.

If you are using any dummy project, stub or simulator, they are also good candidates for being switched to a container-based approach first. They usually won't get to production however, as they will only be in use in CI and perhaps some of the other testing environments.

You can also see how this piece-wise approach lets you run both versions of a component in parallel, move between one and the other via configuration and finally remove the older approach when you are confident you don't need to roll back. At the start using a Docker image doesn't seem like a huge change, but sometimes you end up with 50 modified files in your Infrastructure as Code repository, and 3-4 unexpected problems to get them through all the environments. This is essentially Branch by Abstraction applied to Infrastructure as Code: a very good idea for incremental migrations applied to an area that normally needs to move at a slower pace than application code.

Invisible to the eye