Mindware Engineering
What are SLIs, SLOs, and SLAs?
A way to reason about software quality
When we start to develop software, we usually care about its quality.
Software quality is often divided into Functional and Non-functional.
Functional quality is what software can do for its users.
Non-functional is everything else that helps the software to serve its functions. And it includes a lot of things. Among them are the code quality, maintainability, how easy we can deploy software, how secure it is, and much more.
When we think about quality, we want to have some measurable characteristics. This gives us ability to make reasonable and data-driven decisions.
Availability
Let’s look at availability. This is the fraction of the time when service is usable.
For example, we can find it for a web service as the ration of successful requests to the all requests. By successful requests we mean requests that are not 5xx.
Well, what these numbers can give us?
When we design a solution.
When we are discussing system design. We can make choices about the compoents for our systems.
For example, we can check that droplets in Digital Ocean have 99.99% SLO, or that single instance of Google Compute Engine has low bound 99.5% availability in a month.
Basically, we can check availability for different cloud components in AWS or Google Cloud. We can ask how many users will be effected if we will be unavailable for 10 minutes each week.
When we maintain a system.
- We can improve our strategies with deployments. We will have metric for this.
- We can have a focus on the critical parts of the system with higher availability.
- We can change our technical solution to complain with required availability.
SLI
An SLI is a service level indicator - a measurable metric of some aspect of service.
The measurements are often aggregated: raw data is collected over a time window and turned into a rate, average, or percentile.
Examples:
- Availability is the fraction of the time that a service is usable.
- Request latency - how long it takes to return a response to a request. 99th percentile for latency will be example of such indicator.
SLO
An SLO is a service level objective is a target value for a service level that is measured by an SLI.
- SLI ≤ target
- lower bound ≤ SLI ≤ upper bound
For example:
- 30 days availability ≤ 99.9%
- P99 Latency for 99.9% of requests
< 1 sec
SLA
An SLA is a service level agreement is a contract with your users that includes consequences of meeting (or missing) the SLOs.
An easy way to tell the difference between an SLO and an SLA is to ask “What happens if the SLOs aren’t met?”.
The consequences are most easily recognized when they are financial - a rebate or a penalty.
But also it can be internal policies. For instance, a team can freeze implementation of the new features until availability will be fixed.
Final thoughts
SLIs, SLOs, and SLAs are the great tools that allow us to work with quality of service.
- Solid SLOs helps us to design better system.
- Right SLOs gives a team confidence that a service is healthy.
- Choosing appropriate SLOs helps to make the right action if something goes wrong.
Further reading
You can find more info at this article from the Google SRE Engineers.