I was once asked what in my opinion reliability of a distributed system was. My answer was a hodgepodge of various terms explaining that reliability is the ability of a system to not fail/withstand failures to some extent and continue to serve traffic. As I gain more experience working with distributed systems, I’ve come to realize that reliability is a conversation. It is a conversation operators have with the systems, the underlying infrastructure, and the various services as we operate them and as we have them serve customers. Reliability is also a conversation that SREs and product/development teams have to make sure that we can build a product that meets our requirements.
SLIs and SLOs help quantify reliability and help with communicating reliability to others.
- SLI - Service Level Indicator
- SLO - Service Level Objective
- SLA - Service Level Agreement
SLI Link to heading
An SLI is a metric that tells us what is being measured for a service. Examples could be the following,
- Response time - the amount of time it takes between sending a request and getting a response
- Throughput - max number of requests the system needs to handle
- Error rate - ratio of failed requests to successful requests
- Availability - a fraction of the time a service is usable.
An ideal SLI reflects the end user’s perspective.
SLO Link to heading
An SLO is a target value or range of values for a service level that is measured by an SLI. Therefore, a natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. To further simplify, SLO = SLI + Thresholds. For example,
- Setting availability SLO to a threshold of 99.9% means the service will be down for no more than 1 hr/month.
- Setting the error rate SLO to <1% means that averaged over a period of time, the error rate for service will be less than 1%.
The SLO is the “proper level of reliability” targeted by the service.
SLA Link to heading
An SLA defines what happens if we don’t stay within the thresholds of the SLOs. To simplify, SLA = SLO + Consequences. Common consequences for not meeting SLAs are financial - like a rebate or a penalty, and could also be in other forms.
For example, AWS offers this SLA for the EC2 service at the instance level: Less than 99.5% but equal to or greater than 99.0% monthly uptime percentage would result in a 10% Service Credit Percentage. A more detailed breakdown for the EC2 service can be found here
To summarize,
- An SLI tells us what we measure.
- An SLO tells us what our goal is.
- An SLA is a promise made to the clients/users.
Error Budget Link to heading
The error budget measures how the SLI performed against the SLO over a period of time. It defines how unreliable your service is permitted to be within that period and serves as a signal of when you need to take corrective action. When a service exceeds its error budget, operators can pause/freeze further deployment to eliminate persistent causes of error in the system.
Reliability is not just a measure; it’s a dialogue between systems, operators, and users. By quantifying reliability through SLIs and setting clear goals with SLOs, we can deliver on the promises made in SLAs. The error budget then ensures that we maintain a balance between innovation and stability. How do you approach reliability in your systems? Let’s discuss!