Ask the Expert: SRE | C2C Community

Ask the Expert: SRE

  • 13 April 2021
  • 6 replies
  • 79 views

Userlevel 1

Hi C2C! I’ll be hosting an Ask the Experts session around Site Reliability Engineering (SRE) next week:

Ask the Experts: Site Reliability Engineering

I hope to see you there... meanwhile, let’s start the conversation here! Got any burning questions? Anything is fair game.


6 replies

Userlevel 6
Badge +12

I’ll go first: 

  1. SRE skills development for those looking to crack into SRE careers:
    1. Outside of true OJT, how to best simulate the depth and complexity of live production architectures and application environments to learn from as the core skillsets are built?
    2. Many job reqs for SRE roles point at the need to find highly talented, multi-faceted generalists who have specializations and depth in SWE, SE, from a diverse set of backgrounds,,,Many hiring managers talk about the difficulties in finding great quality SRE candidates. Where is the middle ground and by what path(s) can aspiring SREs go down to grow into this career?
    3. What is the aspirational career path for the best and brightest SREs of today? 
    4. How does the comp compare to other SWE/SE roles?
    5. Any major differences between SRE roles by Industry? Company size?  Another segmentation type? 
    6. What makes SRE jobs fun and exciting? What pushes folks to leave this career field? 
  2. Acronyms...how to keep up with and make sense out of what’s important in the soup?
    1. SOA, ​​​​​​​SLOs, SLIs, SLAs, MTTD, MTTR, MTF, MTBF,...the list goes on. How to gain exposure to these, truly understand how they fit into the work and client/customer expectation set, and how to deliver against them in real time as you are managing the stack. Which ones are the most prominent and important to the actual SREs vs. the Business vs. IT team? 
  3. How to quickly consume, digest, and be ready to run against new architecture when you are getting starting in a new role or with a new client? Any fun or interesting war stories to share around doing this really well….really poorly? and what was learned in the proces

 

A few to get rolling off the cuff. 

JB

Userlevel 6
Badge +12

I also found this article by Gremlin to be informative and helpful on how to become a top-notch SRE as it brings together notes on the actual role as well as curates training and resources from all over for folks who’d like to lean in further. @davidstanke I am sure Google has an avalanche of great resources to lean on for the community, considering you guys kicked the door down in this field :)

 

https://www.gremlin.com/site-reliability-engineering/how-to-become-a-top-notch-sre/

Userlevel 1

Thanks for the questions @JBranham ! I’ll take a stab…

  1. Outside of true OJT, how to best simulate the depth and complexity of live production architectures and application environments to learn from as the core skillsets are built?
    1. That’s a tough one! The instincts and patterns that one learns from supporting real production systems--with real people on the other side depending on them!--are hard, maybe impossible, to gain from tutorials. Apprenticeship is really valuable here, and nobody should expect that someone new to the role of SRE can singly support prod on day one (or ever). Bear in mind, though, wrestling with production outages is only one part of the job of SRE. People looking to break into the role can learn the language, train up on tools, explore automation, and be ready to contribute in myriad ways before they step up to primary on-call.
  2. Many job reqs for SRE roles point at the need to find highly talented, multi-faceted generalists who have specializations and depth in SWE, SE, from a diverse set of backgrounds,,,Many hiring managers talk about the difficulties in finding great quality SRE candidates. Where is the middle ground and by what path(s) can aspiring SREs go down to grow into this career?
    1. Did I mention apprenticeship? 🙂 As often happens when a new specialization emerges in our industry, there’s a very small pool of experienced people. (We saw this with cybersecurity, UX, product management, ...) Train people up from within! And bear in mind that domain knowledge is a huge asset: the sysadmins who know all the quirks of those individual machines, and the devs who know where the code-ghosts are buried are in the best position to automate, streamline, and scale--they just need permission, inspiration, and training.
  3. What is the aspirational career path for the best and brightest SREs of today?
    1. Mostly, it’s about continuing to find bigger and harder problems. But there are also some great examples of SREs who have moved from scaling systems into scaling the practice of SRE itself. See: Liz Fong-Jones of Honeycomb, or Alex Hidalgo of Nobl9.
  4. How does the comp compare to other SWE/SE roles?
    1. I’m honestly not sure. I’d expect it to be a little better, since it involves more things? But IDK if that’s true. One thing I know is that at Google we pay people (whether SWEs or SREs) for on-call shifts--if you have to remain ready to respond outside of your regular work hours, you should be compensated for that.
  5. Any major differences between SRE roles by Industry? Company size?  Another segmentation type? 
    1. Small companies (and early-stage products, which are correlated) typically don’t have dedicated SREs, though they’re increasingly adopting SRE language and practices, which is a great way to prepare for when dedicated SREs are eventually necessary. Hardware-focused products are mostly uncharted water for SRE, though the principles can be applied. Other than that, I’d say it’s mostly just a case of: the more essential it is to a company that their services are reliable, the more likely they are to need the kind of proactive operations that SRE provides. Every day, more and more industries become this.
  6. What makes SRE jobs fun and exciting? What pushes folks to leave this career field? 
    1. Honestly, I can’t think of anyone I know who has left the domain--they’ve bounced between internal-facing vs customer-facing, hands-on vs abstract, etc., but it seems to be a role with staying power. What makes it exciting is that it’s always changing. The goal of SRE is to make it so you never have the same incident twice: we learn, adapt, and evolve our systems to be resilient. But that doesn’t mean we automate ourselves out of a job--computers will always devise diabolical new ways to malfunction.
Userlevel 1

I also found this article by Gremlin to be informative and helpful on how to become a top-notch SRE as it brings together notes on the actual role as well as curates training and resources from all over for folks who’d like to lean in further. @davidstanke I am sure Google has an avalanche of great resources to lean on for the community, considering you guys kicked the door down in this field :)

 

https://www.gremlin.com/site-reliability-engineering/how-to-become-a-top-notch-sre/

That’s a great list of resources! Also check out sre.google and Training Site Reliability Engineers

Userlevel 6
Badge +12

@davidstanke  This is fantastic! Thank you for the care and feeding on both this topic and community questions.  The sre.google page is a great resource and re: the Training SRE’s pub, perhaps we can host you, Jennifer, JC, and Preston for a community SRE call down the line once we have a group built up in the community around SRE :)

JB 

Userlevel 6
Badge +12

One other thought @davidstanke...it would be great if you can pull Ahmet, Karl, and Sarah on here to tee up a similar post, focused on Ask an Expert: Serverless and Ask an Expert: CloudAI Platform. 

We can drive pre-event discussion and conversation amongst the community here in the leadup to your sessions in April, May, and June and then perhaps we host all four of you for coffee or drinks in July so the community can meet you in person and continue the discussion and questions.  Thoughts? I know @ilias would be game to host some great discussions out of our EMEA footprint as well. 

Reply