Scaling SRE through "Reliability as Code"

By Brandon Morgado

Elevator Pitch

Our SRE org has leaned into creating tools that all consume yaml files. This leverages “infrastructure as code” patterns to improve micro-service observability and resiliency. This presentation will walk through the evolution of our “Reliability as Code” tools and share how our SRE org has scaled.

Description

We would like to share how we solved the problems of implementing reliability in a micro-service environment at scale while building a Site Reliability organization.

Over the last few years, we have leaned into creating tools that consume yaml files. This allows us to quickly provision new instances of our tools to assist with our micro-service observability and resiliency. This presentation will walk through the evolution of our “Reliability as Code” tool suite and show how the speed of our development process has allowed us to scale.

I will be discussing some patterns that we have implemented in our reliability tooling and share potential pitfalls along the way.

Reliability as Code Overview:

A. Heathchecks as Code. B. Dashboards as Code. C. Alerts as Code. D. Resiliency as Code. E. … and Beyond.

Notes

As the current Director of SRE at a local company, I’ve been building a Site Reliability organization for the last 5 years. We’ve faced numerous challenges over the years that were primarily related to “scale”. The pattern discussed in this talk will share how we leveraged the software development expertise in our teams to scale our reliability “sub-linearly”.

We first utilized “healthchecks as code” to create lambda functions that systematically check the availability of our micro-service endpoints every single minute.

We then built “dashboards as code” to automate the creation of Grafana dashboards for each of our micro-services. As soon as a new service is provisioned, there is a fresh, beautiful Grafana dashboard already waiting for the developers.

Next came “alerts as code” to quickly provision alerts that connect our infrastructure monitoring to our paging tool.

Lastly, we built “loadtests as code” to streamline the performance testing of our services. For every new deploy, our load test tool will generate synthetic traffic to the micro-service endpoint in order to determine the performance characteristics of the latest version of our code.