How do you quantify user happiness? It’s not easy to measure directly in our systems, but we can look for signals in the user journey. You may experience an outage or other problem that internally seems relatively small, but your users take to Twitter in droves and express their displeasure. Or, you may have a catastrophic event but receive few or no complaints from end users. It is impossible to get inside your users’ heads and see whether they are happy or not while using your service.
To overcome this problem, we use the happiness metrics also known as Service Level Indicators (SLIs). SLIs specify, measure, and track user journey success. They are quantifiable measures of reliability. SLIs tell you whether you are in or out of compliance with your SLO targets and are therefore in danger of making users unhappy.
Once you choose the services you want to measure, you can then think about the SLIs that you will use to measure users common tasks and critical activities. Choosing SLIs that represent the customer’s experience and obtaining accurate SLI measurements are two of the most difficult tasks that organizations undertake on their SRE journeys. The key to selecting meaningful SLIs is to measure reliability from the user’s perspective, not your perspective.
SLIs metrics measure an approximation of a customer’s experience using your service. Such as, availability is an important SLI, it should not be the only SLI you use to measure the reliability of your service. Request latency and error rate are also important metrics indicative of system health. Depending on your service, durability and system throughput should also be consider as metrics. The SLI measures your performance against the SLO. If you continue to miss your SLO, your user experience suffers, and you must take action to bring the SLI back into compliance with the SLO.
The process for identifying, measuring, and monitoring SLIs may seem daunting, keep in mind that having an imperfect SLI is better than no SLI. With your SLIs identified, you can set achievable and aspirational SLOs, a target level of performance for an aspect of your service. When the SLI is above the SLO threshold, you know customers are happy. If it falls below the target, your customers are typically unhappy. As your SLI and SLO practices mature, you can build more sophisticated SLIs that more closely correlate with end-user problems.
This chapter from the book "SLO Adoption and Usage in Site Reliability Engineering" explains how to select the best SLIs for your services, how to build SLIs, considerations for choosing measurement strategies, and, finally, how to use SLIs to set SLO targets.
Read more at Constructing SLIs to Inform SLOs