Since Google first began sharing details of its approach to massive-scale web operations - Site Reliability Engineering, or SRE - many organizations have adopted aspects of the SRE approach to help deal with the emerging complexity of modern software systems. However, there are really several different flavours of SRE, so how can we decide which is right for our organization at any time? How does SRE fit with other approaches to IT service management?
Why SRE and why now?
Modern cloud software systems are significantly more complicated (and sometimes complex in a Cynefin sense) than the simpler software systems of previous decades. I have been helping organizations to adopt good engineering practices for many years and what I see time and again is that many organizations are stuck thinking that creating and running software is essentially straightforward, whereas in fact software delivery is now a highly involved sociotechnical discipline.
SRE is Google’s response to this emerging complexity in modern software systems. In a nutshell, SRE is an enforced operability discipline, requiring product managers to spend the right amount of time and effort (but no more) on operability and reliability based on how users perceive the availability of the service.
What you’ll learn from the SRE whitepaper
First, we explore the core concepts of SRE like Error Budgets, Service Level Objective (SLO), Service Level Indicator (SLI), and realistic reliability targets (no service can ever be 100% available). Even simply exploring the SLOs and SLIs and defining reliability targets for existing software can be hugely beneficial. Many organizations have a poor picture of the actual availability of their software services as perceived by users, so taking a metrics-driven, user-centric approach to reliability and availability can really help to improve these services without any immediate changes in teams or skills.
Second, we look at the different flavours of SRE, all of which are in use at Google and other high-performing organizations. Sometimes, a separate SRE team can be the right approach but often the benefits of SRE come from the focus on user-perceived availability, not a separate team. At other times, the SRE team might act as an enabling team to help software teams understand operability and reliability engineering, keeping the software team on-call for incidents and outages. At other times, the SRE team might help to provide production-grade services that software teams can use easily. There is no “one-size-fits-all” approach with SRE, apart from a relentless focus on user-perceived service availability.
Crucially, a separate SRE team is optional by design. Yes, at Google only high-volume, well-behaving software services get the benefit of support from an SRE team. The default position is for software teams to build-and-run the software they own by themselves.
We look at some pitfalls (dangers) associated with naive adoption of SRE, especially simply renaming an existing System Administration or IT Operations team as “SRE” without putting in place the engineering and product discipline around the Error Budget and SLO/SLO measures.
Finally, we cover the relationship between SRE and ITIL®: both approaches aim to achieve continuous improvement of software services, but SRE does this by a clever balance between features and reliability with the trade-off decisions resting on the product owner.
To read more about SRE, download the AXELOS white paper Site reliability engineering.
Matthew Skelton is co-author of Team Topologies: organizing business and technology teams for fast flow (IT Revolution, 2019) and Team Guide to Software Operability (Skelton Thatcher, 2016). Head of Consulting at Conflux (confluxdigital.net), he specialises in Continuous Delivery, operability and organisation dynamics for software in manufacturing, ecommerce, and online services, including cloud, IoT, and embedded software.