Sign in

Site reliability engineering White Paper

White Paper

Site reliability engineering White Paper

White Paper

  • White Paper
  • Customer engagement
  • Service management
  • Value
  • ITIL

Author  Matthew Skelton

March 27, 2020 |

 20 min read

  • White Paper
  • Customer engagement
  • Service management
  • Value
  • ITIL

The complexity and increasing pace of change of modern large-scale cloud software systems have forced organizations to adopt new practices, disciplines, technologies, tools, and organizational dynamics.

Older operating models, although informative, are insufficient. These issues have led to the recent release of ITIL® 4.

Modern approaches to IT service management emphasize the importance of rapid value co-creation and fast flow1. One such approach is site reliability engineering (SRE). This white paper explains how and why organizations can benefit from SRE.

1.1 KEY POINTS

  • SRE is a proven approach to building and running massive-scale cloud software systems.
  • There are several different SRE practices, many of which are independently beneficial. Adopting all SRE practices together is often unnecessary.
  • SRE is optional by design: it is not a method of running software in production.
  • Organizations should not rebrand traditional IT Operations as SRE if other practices are not adopted, at the risk of pitfalls.
  • SRE can be seen as ITIL continual improvement that is strongly focused on reliability and scalability for large-scale cloud systems.

Overview of SRE

SRE is an approach to building and running large-scale cloud-based software systems. It originated at Google in 2003 and has since been evolving there and at other large internet companies. SRE is an approach founded on principles, practices, and organizational dynamics that aim to ensure the reliability of large-scale cloud software systems that are continually developing.

Google documented their SRE approach at length in Site Reliability Engineering2. For Google, SRE is effectively IT Operations as designed and run by software engineers.

SRE is what happens when you ask a software engineer to design an operations team. (Benjamin Treynos Sloss, Google)3

From this perspective, SRE demonstrably precedes DevOps. SRE emerged five years before collaborative DevOps practices became common. It arguably contains heritage of very separated ‘development’ and ‘operations’ disciplines. Compared to DevOps environments of multi-disciplined teams, SRE is notable for retaining a separate reporting structure and career progression lines for building and running software.

2.1 CORE SRE DYNAMICS

SRE aims to ensure that the reliability and other operational characteristics of modern, large-scale software are addressed rapidly and effectively. In Google’s model, many product development teams (‘Dev’ or ‘SWE’ teams) are responsible for defining and building the bulk of the features in the service or application. A smaller number of SRE teams may be asked to help with reliability.

People on SRE teams are either software developers with strong operations knowledge or IT operations professionals with strong software development skills. SRE teams use software to solve problems; they will write software to automate any manual service-restoration task that they have performed more than twice. Because SRE teams understand and practice modern software development techniques, their software is well-written, with test scaffolding running in a continuous integration environment.

When software engineers design an IT operations function, everything is code; servers, infrastructure, updates, rollbacks, and scaling are all defined and executed as code rather than as interactive human operations. This ‘everything as code’ approach has several important implications, including that all changes:

  • start in version control (such as a Git repository)
  • are tracked with software tooling
  • are testable using test-first development techniques and test-driven frameworks, such as Cucumber and RSpec
  • are designed with instrumentation and observability in mind so that problems can be detected quickly.

Software written by SRE typically performs better in production than software not written with these practices in mind. SRE ensures that common operational problems are managed early and often.

In summary, SRE is a high-skill operating model for online high-traffic software services. People in SRE teams have excellent coding skills and a strong drive to automate repetitive operations tasks using code, thereby continually reducing toil.

Image of Definition Toil site reliability textbox-1


The SRE model balances several metrics and team dynamics. An SRE workflow might be:

  1. Multi-skilled product development teams run their own services, including being on-call for incidents.4
  2. If and when the service reaches high traffic, the development team may ask an SRE team to run the service in the production environment. The development team utilizes the SRE team’s reliability and high-scale engineering skills.
  3. The product owner for the service must define a service level objective (SLO) based on the downtime that they deem acceptable.
  4. The available downtime becomes the ‘error budget’ for the service, which the development team can ‘spend’; for example, by testing new features. If the total downtime exceeds the error budget, no new changes are permitted.
  5. To be permitted to deploy more changes, the development team must demonstrate increased reliability through automated operational tests.
  6. Site reliability engineers must spend no more than 50% of their time dealing with toil5 (the other 50% should be spent coding); if a service supported by SRE starts to need more than 50% of site reliability engineers’ time to deal with incidents, the SRE team should assign operational improvements back to the Dev team.

This allows teams to address operational problems rapidly, and it keeps product owners honest about both the required SLO and the level of operability in their software service. Services that must be highly available need huge investments in automation and testing to enable a continual flow of user-visible changes.

This ‘software-first’ approach to IT Operations can also apply to the Dev team.

[…] monitoring the amount of operational work being done by [site reliability engineers], and redirecting excess operational work to the product development teams: reassigning bugs and tickets to development managers, [re]integrating developers into on-call pager rotations, and so on. The redirection ends when the operational load drops back to 50% or lower. (Stephen Thorne, Google)6

So, if a Dev team produces software that the SRE team cannot operate within the 50% balance, the Dev team must adopt operational tasks, learning about operational aspects as needed7. It is crucial that the Dev team balances utilizing the SRE team’s skills with retaining responsibility for the operability of the software.

2.2 GOOGLE-STYLE SRE

The SRE approach explained in Site Reliability Engineering is clearly extremely effective, as is evidenced by the near-legendary resilience and reliability of Google’s services and systems. Google operates on such a scale (millions of requests per second across thousands of services) that system behaviour is often difficult or impossible to anticipate and must be discovered in the live (production) environment. SRE teams use their expertise to enhance the reliability of already-excellent software, adjusting for edge cases and unexpected performance or failure conditions. This allows them to ensure that the right amount of effort is applied to managing software operability in order to meet availability targets.

So, Google-style SRE is effective in organizations with constant, high-traffic cloud applications and services where it is difficult to predict which interactions and error conditions will appear in the live environment, coupled with very large technical teams. Historically, few organizations had systems on this scale.

However, as cloud-based software becomes more pervasive, an increasing number of organizations will find themselves in a similar high-volume, low-predictability situation. Google-style SRE could benefit these organizations, if they can sustain the high degree of engineering discipline needed to balance feature development and reliability.

Benefiting from SRE: the pitfalls

Google-style SRE contains several elements that can be separated and used independently or in combination to address different challenges when building and running software at scale.

3.1 SERVICE LEVEL OBJECTIVES VIA SERVICE LEVEL INDICATORS

The SLO for the application or service being run by the SRE team is central to the SRE approach. An SLO is a performance or availability target: a degree of performance or availability that meets business expectations at an acceptable cost. For instance, the SLO for a simple postal code lookup service could be to serve 95% of requests within 100 milliseconds. An SLO for a more complicated or important service may include additional bands of performance: serve 95% of requests within 100 milliseconds and serve 99% of requests within 200 milliseconds. The SLO should relate to the purpose of the specific service or application.

The service’s availability or performance is measured using neural, agreed monitoring tools, so determining whether the SLO has been breached is straightforward. Because of the laws of physics and computer networking, SLO’s are always less than 100%. However, they can vary from 99% availability or less (around 7 hours of downtime per month) to 99.999% availability or higher (26 seconds of downtime per month). The product manager for the service must choose an appropriate SLO that provides enough allowable downtime to cover unforeseen problems while delivering features and updates at an acceptable rate.

Because service downtime is measured by neutral pervasive tooling, the results are indisputable. The SLO approach also drives the adoption of synthetic transaction monitoring8, an excellent practice for customer-facing systems. Synthetic transaction monitoring tests end-to-end customer journeys regularly (such as every 60 seconds) from an automated script. This brings the service closer to the customer and, by extension, the Dev and SRE teams closer to the customer as well.

Conformance to SLOs is measured through service level indicators (SLIs). An SLI is a single quantitative measure of some aspect of the behaviour or performance of a service or system. SLIs generally relate to characteristics that users care about, such as response time for web applications or durability for data persistence. It is important to select a small number of SLIs that are relevant to users.

Choosing too many indicators makes it hard to pay the right level of attention to the indicators that matter, while choosing too few may leave significant behaviours of your system unexamined.9

Crucially, the SRE approach focuses on the ways in which users experience services and applications. Instead of monitoring uptime (when a process is running or a webpage is present), SLOs driven by SLIs prioritize the quality of the interaction experienced by the user, which is ultimately one of the most important criteria for successful software.

Image of Key Message regarding Service Level Objectives and Service Level Indicators site reliability textbox-2

3.2 BALANCING SPEED AND RELIABILITY WITH THE ERROR BUDGET

Fundamentally, the SRE function is empowered to reject low-quality software. Specifically, if Dev teams ask an SRE team run their software, the SRE team can demand evidence of operability in the form of automated test results and instrumentation. If the code is not good enough from an operations perspective, the SRE team can and should reject the code as unfit for production.10

Google manages this focus on proven operability with an error budget. Each service or application run by an SRE team has an associated availability target that comes with allowed downtime. For example, for 99.8% availability, the allowed downtime is just over 87 minutes per month. This allowed downtime is the error budget.

The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. (Mark Roth, Google)11

The SRE team tracks a service’s downtime and, if the service is within its error budget, the Dev team can deploy new changes. Changes may cause an outage. If changes cause the service to be unavailable for longer than the error budget allows, the Dev team cannot deploy new changes in that period and must demonstrate significantly improved reliability before deploying any more changes.

In practice, Dev teams and SRE teams collaborate to prepare the software for production, collaborating on instrumentation, performance, resilience, error codes, and so on. This means that when the Dev team hands over software to the SRE team at the production readiness review, it has already been proven to work well. Even so, the SRE team’s ability to insist on good operability is crucial for the success of the SRE approach.

With an error budget in place, SRE teams and Dev teams can straightforwardly discuss risks.

[The error budget] metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow. (Mark Roth, Google)12

This means that accurate measurements of service availability are essential for building trust between SRE and Dev teams. Investing in high-quality tools and training for site reliability engineers to be able to measure and report on service availability is crucial.

When the error budget is too restricting, there are two options:

  • redefine the SLO to be less available and therefore possibly increase the error budget
  • improve the software’s operational aspects so that it has better operability and fails less often.

Image of key message regarding The Error budget site reliability textbox-3

3.3 NEW SKILLS FOR EVERYTHING-AS-CODE IT OPERATIONS

A major difference between SRE and older approaches to IT operations activities is that SRE takes an ‘everything-as-code’ stance. If an action or remediation activity needs to be done more than twice, it will be automated using code. The focus on using code to define and solve operational problems has effectively defined a new set of skills in IT that are embodied in the site reliability engineer role.

A site reliability engineer usually has very strong expertise in troubleshooting computer networking, server, and middleware performance problems, DNS misconfiguration, and application behaviour glitches and related problems. They are also deeply aware of networking protocols, monitoring and logging, and Linux command-line tools. Site reliability engineers will expect to automate toil from their day-to-day activities through automation and code.13


Image of key message regarding Everything-as-code site reliability textbox-4

STRONG FOCUS ON OPERABILITY LEADS TO RELIABLE SYSTEMS

The user-centric balance of features and reliability within the SRE approach promotes a strong focus on operability: the runtime operational effectiveness of software systems. Prioritizing operability as a first-class ensures that operational aspects of the software (performance, resilience, testability, observability, and so on) are addressed. With the increasing complexity, scale, and vulnerability of modern software systems, this attention to operability indicates truly forward-thinking organizations. Production-facing teams (whether SRE or not) can bring huge value to organizations by working with Dev teams to identify and address operational challenges early on, enabling them to improve the operability, release-ability14, simplicity15, and monitorability16 of the software.

Image of key message regarding SRE approach site reliability textbox-5

3.4 MULTIPLE ORGANIZATIONAL MODELS FOR SRE

Google-style SRE uses a separate reporting structure for SRE teams but other organizational models can be used to promote many of the SRE practices and principles. Some effective organizational models include:

  • Using an SRE team to enable or facilitate the adoption of good operability and reliability approaches. In this model, the SRE team is typically not on-call for live operations but provides subject matter expertise to the Dev team.
  • Using an SRE team to define and build a compelling platform to help Dev teams. The SRE team aims to simplify the challenges around monitoring, deployment, scaling, fault tolerance, and so on by providing developer-friendly tooling and automation that improves reliability through a set of proven defaults. In this model, the SRE team is not on-call for application-level incidents.
  • Using site reliability engineers in Dev teams to increase the reliability and operability of software as it is being developed. In this model, site reliability engineers may be on-call alongside colleagues from the Dev team.
  • Google-style SRE with a separate reporting structure, collaborating with Dev teams. In this model, the SRE team only becomes involved in a software system when the usage of the system has passed a certain scale.17

Image of key message regarding multiple organizational models site reliability textbox-6



Potential challenges of SRE

The SRE approach clearly has many advantages, as are demonstrated by the reliability and scalability of the software systems of organizations such as Google. With the right organizational dynamics (especially mutual trust and engineering discipline), innovation at scale and speed becomes achievable with SRE. However, given Google’s near-unique position as a pioneer, innovator, and leader in cloud software, it is important to consider how well the SRE approach can work in other organizations, and what pitfalls organizations should be aware of. This section explores possible problems that would need to be overcome for SRE to be successful in smaller organizations.

4.1 SRE AS ANOTHER IT OPS TEAM

A separate SRE team or function may be treated as ‘just another IT operations team’ whose responsibility is to run software systems in production, reintroducing all of the drawbacks of that approach, including limited awareness of operational concerns in Dev teams, instability, huge out-of-hours software deployments, and so on.

A traditional IT operations team is unlikely to have SRE skills without significant further training. Effective SRE (regardless of the organizational model that is adopted) requires people with an unusual range of skills, including:

  • deep knowledge and experience of operating systems, container fabrics, computer networking, alerting, and monitoring
  • drive to collaborate with Dev teams on improving the operability of the software applications in production
  • ability to focus on business-relevant SLOs.

These skills differ hugely from the traditional IT Operations skillset. Therefore, renaming the ‘Ops team’ to the ‘SRE team’ is unlikely to produce good results. A new organizational dynamic, including high trust be- tween software teams and site reliability engineers and good discipline from product management around error budgets and reliability, is needed.

4.2 SEPARATE REWARD STRUCTURES LEAD TO CONFLICT

In organizations with strong engineering prowess and mutual respect, the separate organizational reporting lines of product development (software engineering) and SRE lead to a balance between the delivery of new features and software reliability. Within the limits of an SLO, the Dev team can deploy as many changes as possible without exceeding the error budget, maximizing the rate of innovation. The SRE team helps to improve the reliability of the software service through specialist diagnosis and implementation, maintaining or increasing future innovation.

Although these different focuses are important, having separate reward structures for features and reliability can cause conflicts between the two groups. Without trust and discipline, organizations with separate reward structures risk antagonism between the separate teams, possibly resulting in a rapid decay of speed and safety. These conflicts are partially responsible for the rise in the popularity of DevOps approaches since 2008, which sought to combine the responsibility for features and reliability.

4.3 INSUFFICIENT ORGANIZATIONAL DISCIPLINE

Google-style SRE needs a high degree of organizational discipline in order to work effectively. In many organizations, a team of production-facing engineers would be unable to prevent deployments of unreliable software. Contrastingly, the SRE approach at Google relies on SRE teams being able to reject poor quality code. In the absence of high organizational discipline, which promotes reliability, it is very likely that software quality and service reliability will rapidly decay, leading to unplanned outages, broken SLOs, and unhappy customers.

If an external supplier is providing SRE skills and awareness, operational discipline is even more important. In this context, an organization may remain responsible for the product management and software development and outsource the production-facing SRE activities. In these cases, the commercial contract with the supplier must codify the organization’s ability to reject poor quality software changes. This requires strong trust between the contracting parties.

4.4 OPERATIONAL CONCERNS ARE AN SRE PROBLEM

For organizations that choose to have separate teams or structures for SRE, the operational aspects of software can be seen as solely the SRE team’s remit. This risks a return to a pre-DevOps era of poor reliability and low operability. Organizations should aim for reliable, working software through the early discovery and implementation of operational features, so Dev teams should to take responsibility for operability.

How SRE and ITIL relate and differ

SRE and ITIL may seem different, but they have some similarities. Both SRE and ITIL aim for the continual improvement of software services by enabling the inspection and adaptation of the software, and both emphasize a holistic service as the focus of attention. The focus on the overarching service (rather than on individual pieces of software) has always been a strength of ITIL.

Both ITIL and SRE have a strong concept of service readiness and, in both disciplines, if software is unfit for service it can be rejected by the team responsible for the production environment. Historically, many organizations rarely benefitted from this discipline because the ITIL principles were often restricted to the IT operations teams, which led to missed opportunities for improving operability before the go-live date.

SRE is not a synonym for IT Operations. SRE teams and individuals enhance the reliability and operability of massive-scale software services and detect and remediate unforeseeable problems resulting from massive scale, thereby reducing the cognitive load on product development teams. The presence of SRE teams also helps to enforce a healthy balance between new features and reliability. In this sense, SRE differs significantly from how many organizations historically interpreted ITIL (with separate IT operations functions deploying and running software in production). ITIL 4 emphasizes the need to optimize for a rapid, reliable flow of change, aligning more closely with SRE.

A major difference between SRE and how ITIL has historically been used in many organizations is the speed of the feedback loop for each software change. In the SRE world, changes to running production systems may happen many times per day to many thousands of runtime nodes; feedback from failed or substandard changes happens extremely rapidly, and fixes can be enacted within minutes through collaboration between the SRE team and the Dev team. ITIL 4 aims to change expectations from service owners about the speed and frequency of software changes, aligning ITIL with ‘cloud-native’ approaches such as SRE.

Summary

SRE is a specific approach to enhancing reliability for large-scale cloud software systems. The SRE model promotes a healthy and productive interaction between the Dev and SRE teams using SLOs and error budgets to balance the speed of deployment of new features with the work needed to make the software operate well. SRE therefore needs quite specialist and unusual skills to succeed, as well as high trust between teams.

Organizations wishing to safely innovate at speed can adopt many of the elements of SRE without necessarily creating a separate Google-style SRE function: SLOs driven by SLIs, a heightened focus on operability, increased skills and capabilities around cloud operations, and better inter-team working can be used whether or not a separate SRE function is used.

ITIL 4 emphasizes that speed and reliability can coexist, which aligns with modern, cloud-native operating models like SRE.

About the author

Matthew Skelton is co-author of Team Topologies: organizing business and technology teams for fast flow19 and Team Guide to Software Operability20.

Head of Consulting at Conflux (confluxdigital.net), he specializes in continuous delivery, operability, and organization dynamics for software in manufacturing, ecommerce, and online services, including cloud, IoT, and embedded software.

Author Matthew Skelton

Further reading

ITIL and Fast Value Co-creation. Available at: www.axelos.com/case-studies-and-white-papers/itil- 4-and-fast-value-co-creation [Accessed 3 Feb. 2020]

Site Reliability Engineering - Google. Available at: landing.google.com/sre/sre-book/chapters/ introduction/ [Accessed 3 Feb. 2020]

Skelton, M, Thatcher, R, Moore, A. (2016). Team Guide to Software Operability. Skelton Thatcher Publications.

Peroff, J, Murphy, N, Jones, C, Beyer, B. (2016). Site Reliability Engineering. O’Reilly Media, Inc.

End Notes

1 ITIL and Fast Value Co-creation. Available at: www.axelos.com/case-studies-and-white- papers/itil-4-and-fast-value-co-creation [Accessed 3 Feb. 2020]

2 Peroff, J, Murphy, N, Jones, C, Beyer, B. (2016). Site Reliability Engineering. O’Reilly Media, Inc.

3 Site Reliability Engineering - Google. Introduction. Available at: landing.google.com/sre/sre-book/ chapters/introduction/ [Accessed 3 Feb. 2020]

4 Why should your app get SRE support? - CRE life lessons. Available at: cloud.google.com/blog/products/gcp/why-should-your-app-get-sre-support-cre-life-lessons [Accessed 3 Feb. 2020]

5 Tenets of SRE. Available at: medium.com/@jerub/tenets-of-sre-8af6238ae8a8 [Accessed 3 Feb. 2020]

6 Tenets of SRE. Available at: medium.com/@jerub/tenets-of-sre-8af6238ae8a8 [Accessed 3 Feb. 2020]

7 5 proven operability techniques for software teams. Available at: techbeacon.com/app-dev- testing/5-proven-operability-techniques-software-teams [Accessed 3 Feb. 2020]

8 Power combination: Synthetic and real device app monitoring. Available at: techbeacon.com/app-dev-testing/power-combination-synthetic-real-device-app-monitoring [Accessed 3 Feb. 2020]

9 Site Reliability Engineering - Google. Service level objectives. Available at: landing.google.com/sre/sre-book/chapters/service-level-objectives/ [Accessed 3 Feb. 2020]

10 Site Reliability Engineering (SRE) as a Managed Service. Available at: n4stack.io/2018/09/25/sre-as-a-service/ [Accessed 3 Feb. 2020]

11 Site Reliability Engineering - Google. Embracing Risk. Available at: landing.google.com/sre/sre-book/chapters/embracing-risk/ [Accessed 3 Feb. 2020]

12 Site Reliability Engineering - Google. Embracing Risk. Available at: landing.google.com/sre/sre-book/chapters/embracing-risk/ [Accessed 3 Feb. 2020]

13 Site Reliability Engineering - Google. Eliminating Toil. Available at: landing.google.com/sre/sre-book/chapters/eliminating-toil/ [Accessed 3 Feb. 2020]

14 Site Reliability Engineering - Google. Release Engineering. Available at: landing.google.com/sre/sre-book/chapters/release-engineering/ [Accessed 3 Feb. 2020]

15 Site Reliability Engineering - Google. Simplicity. Available at: landing.google.com/sre/sre-book/chapters/simplicity/ [Accessed 3 Feb. 2020]

16 Site Reliability Engineering - Google. Monitoring Distributed Systems. Available at: landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ [Accessed 3 Feb. 2020]

17 How Reliability and Product Teams Collaborate at Booking.com. Available at: medium.com/ booking-com-infrastructure/how-reliability-and-product-teams-collaborate-at-booking-com-f6c317cc0aeb [Accessed 3 Feb. 2020]

18 5 ways site reliability engineering transforms IT Ops. Available at: techbeacon.com/enterprise-it/5- ways-site-reliability-engineering-transforms-it-ops [Accessed 3 Feb. 2020]

19 Skelton, M, Pais, M. (2019). Team Topologies: organizing business and technology teams for fast flow. IT Revolution Press.

20 Skelton, M, Thatcher, R, Moore, A. (2016). Team Guide to Software Operability. Skelton Thatcher Publications.

Site reliability engineering