Consequently, it impacts also testing infrastructure as it tries to be an environment as close as possible to production, in order to reliably discover bugs and replicate them in a controlled environment without impacting real users.
In particular I like to have a ci testing environment where each service can be tested in its own virtual machine (or container for some); and each node is totally isolated from the other services. In addition to that, I also like to have and end-to-end testing environment where services talk to each other as they would in production, and where we can run long and complex acceptance tests.
This end-to-end environment usually is a perfect copy of production with respect to the technologies being used (e.g. load balancers like HAProx or AWS ELB are in place, in the same way of production, even if there is no test that directly targets their existence); the number of nodes per service is however reduced from N to 2, as in computing there are only 0, 1 and N equivalence categories.
In the likely case that you're using cloud computing infrastructure to manage this 2x or 3x volume of servers with respect to the production infrastructure, your costs are also by default going to double or triple. One option to try and optimize this is to throw away everything and start deploying containers, as they could share the same underlying virtual machines as production while preserving isolation and reproducibility. On an existing architecture made up of AWS EC2 nodes however, optimization can takes us far without requesting to rewrite all the DevOps(TM) work of the last two years.
Phase 1: expandAs I've been explaining, EC2 instances replicating production environments can expand until they bring the total number of EC2 nodes to three times the original number. Some project just have an single EC2 node, while others have multiple nodes that have to be at least 2 in the latest testing environment before production. Moreover, the time the tests take to run on these instances is inversely correlated with how powerful they are in CPU and I/O terms, so you pay good money for every speed improvement you want to get on those 20- or 60-minute suites.
In my current role at eLife, we initially got to more than 20 EC2 instances for the testing environments. This was beneficial from a quality and correctness point of view, as we could then run tests on all mainline branches before they go to production, but also on pull requests, giving timely feedback on the proposed changes without requiring developers to run everything on their machines (that should be an option, not an imperative.)
Phase 2: optimizeThe AWS EC2 pricing model works by putting nodes into a running state when you launch them, and by pre-billing one hour of usage every 60 minutes. Therefore, booting existing instances or creating from scratch is going to incur at least a 1-hour cost for each of these events:
- at boot, 1 hour is billed
- at 60:00 from boot a new hour is billed,
- at 120:00 from boot a new hour is billed, and so on
Since the process of allocating a new virtual machine and reconfiguring networks to connect everything together has some overhead, this reflects in the boot time necessary to start an EC2 instance, which can be several seconds to be added to the standard boot time for the operating system. This is also reflected in the pricing model, which makes it a bad idea to launch a new EC2 instance for every test suite you need to run: as long as your test suite takes less than 1 hour, you are already paying for a full hour of resources and you would be throwing away. Running 6 builds in an hour would make you pay for 6 hours, which is not what you want.
Phase 2.1: stop them manuallyA first optimization that can be performed manually is to stop and start these instances from the console. You would usually stop them at some hour of the day or evening and then start them again as the first thing in the morning.
Of course, there is a good potential for automating this as AWS provides many ways to access EC2 with its API and all the SDKs that build on top of it, available for many different programming languages. You can easily build some commands that query the EC2 API looking for an instance basing on its tags, and then issue commands for starting and stopping it. In general, this is almost transparent for the CloudFormation templates that you are surely using for launching these instances.
The first time you start and stop an instance, there are a few problems thay may come up.
The first problem is that of ephemeral storage: as I wrote before, you have to make sure the root volume of the instance and any data you want to persist are EBS-backed and not local instance storage.
The second problem is that of public IP addresses. While private IP addresses inside a VPC stay the same after a stop and start commands, public IP addresses are a scarce resource and are only allocated to it when the instance is running. Therefore, if you had a DNS pointing to it, it has to be updated after the boot, whether it was manually created or part of the CloudFormation template. Default DNS entries have the form ec2-public-ip-address.compute-1.amazonaws.com which depends on the public ip address and hence does not provide a good indirection.
The third problem is that of long-running processes managed by SysV/Upstart/Systemd: the daemons of servers like Apache, Nginx or MySQL are usually configured to restart upon boot, but if you have written your own deamons or Python/PHP long running processes and are starting them through /etc/init or /etc/init.d configuration, it pays to check everything is in its place again after boot.
The last problem I have found at this level (manual restarts) is about files in /run and /var/run, which are temporary directory used by deamons to place locks and other transient files like a pidfile indicating an instance of that program is running. If you have folders in /run or /var/run, those folder will have to be recreated. Systemd provides the tmpfiles.d option which automatically creates a hierarchy of files, but it's usually just easier (and portable) to have the daemons create their folders (php-fpm does that) or if they are not able to do that, not placing them in /var/run/some_folder_that_will_stop_existing but in /var/run or even /tmp without subfolders.
Phase 2.2: start them on demandInstead of manually starting EC2 instance or to automate their stopping and starting as a periodical task, you can also start them on-demand as needed by the various builds that need to be run. So whenever project x needs to build a new commit on master or a pull request, you will start the x--ci EC2 instance.
In this case, however, there is a larger potential for race conditions as you may try to run a deploy or any command on an instance before it's actually ready to be used. Therefore, we wrote some automation code that waits for several events before letting a build proceed:
- the instance must have gone from the pending to the running state on the EC2 API. This hopefully means AWS has found a CPU and other resources to assign to it.
- the instance must be accessible through SSH.
- through SSH, we monitor that the file /var/lib/cloud/instance/boot-finished has appeared. This file will appear at each boot when all daemons have been started, as art of the standard cloud-init package.
Phase 2.3: stop them when it's more efficientOnce you have transitioned from starting instances in the last responsible moment, you can do the same for stopping them instead of just wait for the end of the day to shutdown everything.
We now have a periodical job, running every 2 minutes, that takes a list of servers to stop. In parallel, it performs the following routine for each of the EC2 instances:
- checks if the server has been running for an amount of time between h:55:00 and h:59:59 minutes, where h is some number of hours.
- if the condition is true, stop the instance before we incur in a new hour being billed.
- otherwise, leave the instance running: you already paid for this hour so it makes no harm to do so, as the instance can be used to run new builds at no cost.
Bonus: Jenkins locksStarting and stopping instances periodically would be otherwise dangerous if there wasn't a mechanism for mutual exclusion between builds and lifecycle operations like starting and stopping. Not only you don't want to run builds for the same project on the same instance if they interfere with each other, but you definitely don't want an instance to be shutdown while a build is still running.
Therefore, we wrap both these lifecycle operations and builds in locks for resource, using Jenkins Lockable Resources plugin. If the periodical stopping task tries to stop an instance where the build is running, it will have to wait to acquire the lock. This ensures that machines that see many builds do not get easily stopped, while other ones that are idle will be stopped at the end of their already paid hour.