Tuesday, March 28, 2017

Book review: Site Reliability Engineering


The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? -- the book tagline
Site Reliability Engineering - How Google runs production systems is a 2016 book about the ops side of Google services, and the set of principles and practices that underlie it.

A Site Reliability Engineer is a software engineer (in the developer sense) that designs and implement systems that automated what would otherwise be done manually by system administrators. As such, SREs have a directive for employing a minimum 50% of their time in development rather than firefighting and maintenance of existing servers (named toil in the book).

The book really is a collection of chapters, so you don't have to be scared by its size as you don't necessarily need to read it cover to cover. You can instead zoom in on the interesting chapters, being them monitoring, alerting, outage tracking or even management practices.

Principles, not just tools

I like books that reify and give names to concepts and principles, so that we can talk about those concepts and refer to them. This book gives precise definitions for the Google version of measures such as availability, Service Level Objectives and error budgets.
This is abstraction at work: even if the examples being used show Google specific tools like Borgmon or Outalator, the solutions are described at an higher abstraction level that makes them reusable. When load balancing practices are made generic enough to satisfy all of Google services, you can bet that they are reusable enough to be somewhat applicable to your situation.

Caveat emptor

Chances are that you're not Google: the size of your software development effort is order of magnitudes smaller, and even when it is of a comparable size it's not easy to turn an established organization into Google (and probably not desirable nor necessary.)
However, you can understand that Google and the other giants like Facebook and Amazon are shaping our industry through their economically irresistible software and services. Angular vs React is actually Google vs Facebook; containers vs serverless is actually Kubernetes vs Lambda which is actually Google vs Amazon. The NoSQL revolution was probably started by the BigTable and Dynamo papers... and so on: when you deployed your first Docker container, Google engineers had already been using similar technology for 10 years; as such, they can teach you a lot at a relatively little cost through the pages of a book. And it's better to be informed on what may come to a cloud provider near you in the next years.


It took some time to get through this book, but it gives a realistic picture of running systems that undergo a large scale of traffic and changes at the same time. Besides the lessons you can directly get from it, I would recommend it to many system administrators and "devops guys" as a way to think more clearly about which forces and solutions are at play in their datacenters (or more likely, virtual private clouds).

No comments: