<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=2634489&amp;fmt=gif">

Session Recording, DevOps and SRE

Getting Started with Site Reliability Engineering (SRE) - Key Takeaways

By Yasin Quareshy | April 26, 2022

On March 17, 2022, the C2C Connect: UK and I group, led by Charlotte Moore (@charlotte.moore), Andy Yates (@andy.yates), Fintan Murphy (@fintan.murphy), Paul Lees (@paul.less), Sathy Sannasi (sathyaram_s.), and Yasin Quareshy (YasinQuareshy), invited Google Cloud Developer Advocate Alexis Moussine-Pouchkine to join them for an hour-long session on Site Reliability Engineering. The group's monthly sessions bring together a local community of cloud experts and customers to connect, learn, and shape the future of cloud. 

 

60 Minutes Summed Up in 60 Seconds

 

  1. Pouchkine started the session by citing a number of publications and books on SRE, and then introduced the focus of the session: the Service Management aspect of SRE, and how it is applied at Google.

  2. Next, Pouchkine introduced DevOps Research Assessment (DORA), which helps measure how an organization compares to the best organizations in its delivery of its services, and how close the organization is to becoming an elite performer.

  3. Pouchkine shared key metrics DORA uses to measure a team's software delivery performance and explained how to set up an environment using FourKeys (available on GitHub) to implement workload measurement methods.

  4. To demonstrate practical implementation, Pouchkine introduced Pic-A-Daily App as a SRE use case. Pic-A Daily App is a photo recognition app that tags an image into a searchable category and an event driven microservice app with several delivery components.

  5. Next, Pouchkine gave his definition of SRE, making reference to the billions of users of Google's services and the 2,500 SREs responsible for the reliability of these services. He also discussed balancing reliability with agility.

  6. Pouchkine discussed tools, infrastructure observability, and culture in detail, citing the following key metrics used to measure impacts on a customer: 

    • Service Level Indicator (SLI), which captures metrics that impact a customer, e.g. availability, latency.

    • Service Level Objective (SLO), or the quality of service promised, e.g. error budget.

    • Service Level Agreement (SLA), a business driven metric not used by the SRE.

  7. Pouchkine also discussed some recommended SRE best practices to follow:

    1. Versioning your software. Having multiple versions of software deployed and ready to serve requests if needed.  

    2. Canary Blue/Green deployments to provide flexibility and confidence in rolling back releases (if required) and A/B testing your software.

    3. Google Cloud Tools discussed that help diagnose and remediate faults. Having a centralized view of things rather than using multiple locations to identify issues.

  8. The climax of the session was a demo of Pic-A Daily App demonstrating how the tooling and SLO metrics can be used to identify and diagnose a fault. Tools that support the SRE include monitoring, error reporting, debugger, logging, traces, and profiler

  9. The session closed with a Q&A and some available resources on the topic.

 

Watch the full recording of this event below:

 

 

Despite its 60-minute time limit, this conversation didn't stop. What are your thoughts on SRE, Service Management, DORA, or any of the other topics discussed above? Reply in the comments below or start a new topic on our group page.

Be sure to sign up for C2C and join our C2C Connect: UK and Ireland group to connect with Google Cloud customers and experts based in the UK & Ireland and beyond

 

Extra Credit

 


Recent Articles

Google Cloud Strategy

AI Cheat Sheet

AI is no more and no less the drive to create robots with human minds so they can do everything we do and more. Use this cheat sheet to help decode the space.
By Leah Zitter
AI and Machine Learning

CarCast with Bruno Aziza: What Makes Us Better Than AI?!

Unlock the secrets of human cognitive superiority over AI in this compelling CarCast with Bruno Aziza and Kenneth Cukier.
By Bruno Aziza
AI and Machine Learning

CarCast with Bruno Aziza: The Metrics You CAN'T Afford To ...

Discover essential CEO metrics: Rule of 40, CAC Ratio, NRR/GRR, and more. Optimize your business for success now!
By Bruno Aziza