DevOps and SRE
Step into the cultural movements that of DevOps and SRE to more quickly deliver reliable and healthy software solutions.
- 50 Topics
- 132 Replies
Modernizing software development with delivery blueprints
Shobhit Gupta, Solutions Architect Google Cloud, will discuss about Modernizing software delivery starts with a strong blueprint. Join us to learn how Google Cloud tools and software can improve developer productivity and CI/CD. We will look at how platform admins can build consistent infrastructure across different environments, app devs can streamline development to focus on coding, and security specialists can monitor and apply security policies. Please check the following link to join this session:https://cloudonair.withgoogle.com/events/innovators-software-delivery-bp
Implement Continuous Integration using Google Cloud Services
Continuous Integration in GCP Continuous Integration forms the CI of the CI/CD process and at its heart is the culture of submitting smaller units of change frequently. The smaller changes minimize the risk, help to resolve issues quickly, increase development velocity, and provide frequent feedback. At its core, it is about getting feedback early and often, which makes it possible to identify and correct problems early in the development process. With CI, you integrate your work frequently, often multiple times a day, instead of waiting for one complex integration at the end of the day. Each integration is first verified with an automated build, which enables you to detect integration issues as quickly as possible and reduce problems downstream.Some important elements are required to make up the CI process, such as making changes in the code, managing the source code and building the artifacts, and storing the artifacts. Google Cloud has an appropriate service for each of the elements
Monitor Cloud Resources using Cloud Monitoring metrics by Prometheus
Cloud Monitoring metrics as Managed Service for Prometheus Prometheus is an open-source systems monitoring and alerting toolkit. Prometheus collects and stores its metrics as time series data, such as metrics information is stored with the timestamp at which it was record, alongside optional key-value pairs. Prometheus provides a functional query language called PromQL (Prometheus Query Language) which lets the user select and aggregate time series data in real time.According to a recent CNCF survey, 86% of the cloud-native community reports that, they use Prometheus for observability. As Prometheus becomes more of a standard, an increasing number of developers are becoming fluent in PromQL. PromQL is a powerful, flexible, and expressive query language but its only able to query Prometheus time series data. Other sources of telemetry, such as metrics generated from logs, remain isolated in separate products and might require developers to learn new query tools to access them.Prometheus
Cloud-native approach for implementing DevOps
Cloud Native Apps and DevOps Services Cloud-native is an approach for building and running applications that exploits the advantages of the cloud computing delivery model. When companies build and operate applications using a cloud-native architecture, they bring new ideas to market faster and respond sooner to customer demands. A cloud-native application is a program, which is designed for a cloud computing architecture. These applications are run and hosted in the cloud and are designed to capitalize on the inherent characteristics of a cloud computing software delivery model. A native app is a software that is developed for use on a specific platform or device.Cloud-native applications use a microservice architecture. This architecture efficiently allocates resources to each service that the application uses, making the application flexible and adaptable to a cloud architecture. These microservices that are part of the cloud-native app architecture are packaged in containers that co
SLAs: Agreement for maintaining the Quality
What Your SLAs Means?A service-level agreement (SLA) is a promise made to a user of a service, to indicate that the availability and reliability of the service should meet a certain level of expectation. SLAs act as a pact between the software provider and the software user or client. SLAs may also include responsiveness to incidents and bugs. It depends on the contract. However, if an SLA is broken, then some penalty may incurred, such as a refund or a service subscription credit.SLAs are an integral part of an IT vendor contract. An SLA pulls together information on all of the contracted services and their agreed-upon expected reliability into a single document. They clearly state metrics, responsibilities, and expectations so that in the event of issues with the service, neither party can plead ignorance. It ensures both sides have the same understanding of requirements. Any significant contract without an associated SLA (reviewed by legal counsel) is open to deliberate or inadvert
Want to know the Metrics to Deliver User Happiness?
How do you quantify user happiness? It’s not easy to measure directly in our systems, but we can look for signals in the user journey. You may experience an outage or other problem that internally seems relatively small, but your users take to Twitter in droves and express their displeasure. Or, you may have a catastrophic event but receive few or no complaints from end users. It is impossible to get inside your users’ heads and see whether they are happy or not while using your service.To overcome this problem, we use the happiness metrics also known as Service Level Indicators (SLIs). SLIs specify, measure, and track user journey success. They are quantifiable measures of reliability. SLIs tell you whether you are in or out of compliance with your SLO targets and are therefore in danger of making users unhappy.Once you choose the services you want to measure, you can then think about the SLIs that you will use to measure users common tasks and critical activities. Choosing SLIs tha
Artifact Repository visibility through Audit Logging
Increasing visibility into our systems, code, and repositories is essential for effective DevOps processes. Previously repository audit logging was available for Docker repositories, but Google Cloud has expanded the capability to include Maven, npm, and Python repositories as well! This is a boon for visibility and monitoring! You can read about how to configure the monitoring by following the link included in this post.How have you implemented monitoring for your repositories? How mature is your organization in providing automated alerting and monitoring? Do you use the audit logging capabilities Google Cloud provides?https://cloud.google.com/artifact-registry/docs/audit-logging
Guiding Our Way, One Chapter at a Time
The Google Cloud DevOps Research and Assessment (DORA) team is always hard at work. I am looking forward to their 2022 Accelerate State of DevOps Report, coming soon to a blog near you. In the meantime, check out the included link for access to the first chapter of the DevOps Enterprise Guidebook.https://cloud.google.com/blog/products/devops-sre/devops-enterprise-guidebook-chapter-1This free guidebook will provide “a concrete action plan for implementing recommendations using Google Cloud’s DORA research to initiate performance improvements.” (from the included article). The first chapter sets the stage for the remaining chapters of the book. It provides information on DORA, the Capabilities Assessment, and some initial steps to take to improve your company’s performance.I’m looking forward to reading the next chapter!What was your biggest takeaway from reading chapter 1?
Setting SLOs: How to select and implement them?
Architecture of web-based e-commerce applicationSLOs are well-defined, concrete targets for system availability. They represent a dividing line between user happiness and unhappiness, and they frame all discussions about whether the system is running reliably as perceived by users. The SLO Adoption and Usage Survey reveal that, despite the importance of SLOs in ensuring the reliability of services, many organizations have not implemented SLOs. The survey also shows that organizations using SLOs fail to update them regularly as their businesses evolve. These missed opportunities can keep organizations from gaining all of the benefits SRE offers.It's because defining SLOs is a difficult task and many do not know where to start. In addition, our experience with customers suggests that teams may lack executive support, a critical component in SLO definition, alignment, and success. This blog from the Google Cloud SRE & DevOps blogs explain a step-by-step guide for building SLOs and de
First recipient of the Google Cloud DevOps Award is ...
Google Cloud has created a new award to recognize excellence in DevOps implementations, and the inaugural awardee is Broadcom.Reading through the linked article, it makes sense why Broadcom was recognized! With an average of 6,500+ deployments across multiple environments in a day, and by consolidating from 60 data centers globally to 27 Google Cloud Regions and points of presence, Broadcom has reduced their teams’ maintenance burden and enabled a new focus on innovation.It’s always fun, to me, to see companies getting DevOps right. There are real benefits that can be realized with good practices!https://symantec-enterprise-blogs.security.com/blogs/feature-stories/broadcom-software-honored-first-ever-google-cloud-devops-award
SLOs: The Magic Behind SRE
What are SLOs?Service level objectives (SLOs) are part of a set of measures to increase the quality of service management. This set of instructions helps to understand which behaviors really matter, how to measure them and how to evaluate them so that the service has an acceptable quality. SLOs are key to making data-driven decisions about reliability. They are at the core of SRE practices. This brings greater confidence to the Site Reliability Engineering (SRE) team about what is important for the full functioning of the service, as well as resulting in a positive end-user experience. Why SREs Need SLOs?Site Reliability Engineers cannot manage their services correctly if they have not identified the behaviors most important to the service and to customers. Carefully considered SLOs prioritize the work of SRE teams by providing data points that allow leaders to consider the opportunity cost of performing reliability work versus investing in functionality that will gain or retain cus
Managed Service for Prometheus just got even more affordable
In the linked blog post, posted just today, Google Cloud has announced that they are modifying the pricing structure for Managed Service for Prometheus. The modifications include the addition of a new tier for those with massive utilization of the service, along with reduction in pricing for every tier available.https://cloud.google.com/blog/products/devops-sre/managed-service-for-prometheus-offers-new-pricing-tierMonitoring and providing clear visibility into your systems is an essential element of DevOps and SRE workloads. Prometheus is a powerful tool to provide that, but can get challenging to maintain in a performant, resilient, consistent manner across widely distributed workloads. Google Cloud’s Managed Service for Prometheus takes the burden of managing and scaling your Prometheus datastore off of your teams so they can focus on using the data instead of on collecting and storing it. Are you using Managed Service for Prometheus? How much of a savings will your team realize from
How SRE Relates to DevOps?
Shared Benefits of DevOps and SREWith the growing complexity of application development, organizations are increasingly adopting methodologies that enable reliable, scalable software. For over a decade, two similar concepts, DevOps and SRE, have been existing together in the world of software development. They may look like competitors. However, a closer view reveals that obvious rivals are complementary pieces of a puzzle that fit nicely together. DevOps is a broad philosophy and culture because it affects more major changes than SRE does. DevOps is more context-sensitive. DevOps is relatively silent on how to run operations at a detailed level. Site Reliability Engineering (SRE), on the other hand, relatively defined responsibilities. Its remit is generally service-oriented and ends user-oriented rather than the completely business oriented.This chapter from the book "The Site Reliability Workbook" explains how DevOps and SRE facilitate building reliable software, where they overlap
Choose wisely, my SRE friends, when establishing SLOs!
The following article was posted on the Google Cloud Devops-SRE blog and provides advice for SRE teams around choosing their SLOs appropriately. I especially appreciate two components of this blog post. First, the emphasis on visibility into your system when assessing appropriate SLOs, and second, the use of visibility and risk analysis in the process.Have you been responsible for identifying SLOs for your systems? Did you use a systemic analysis of risk as recommended by this article? Did you consider the impact on visibility into system performance if the services you use for monitoring and alerting fail?https://cloud.google.com/blog/products/devops-sre/how-sres-analyze-risks-to-evaluate-slos
DevOps 2021 Report from the DORA team
State of DevOps 2021 report (from Google Cloud’s DevOps Research and Assessment - DORA - team) is available at the following link: https://cloud.google.com/devops and has some really interesting takeaways.One of my favorites? The recognition that SRE and DevOps are complimentary cultures! For a while I was seeing a lot of posts on LinkedIn and elsewhere asking whether you thought DevOps or SRE was a stronger methodology, and I always thought the question was odd. In my mind they have always been complimentary. Both highly responsive, focused on keeping the customer happy by meeting their needs rapidly, incorporating as much automation as possible, and relying upon agile teams…What’s your favorite takeaway from the report? Share here in a comment!
Get started with SRE - Webinar Experience
I know cloud is the future, I have recently started getting certified in cloud, I have got certified in some azure fundamentals. I have plans to get certified in GCP too , but waiting for a opportunity like this one :) Recently I came to know that C2C is big GCP community. The event was very interesting, Keeping your business websites up online is a great hard work and knowing how the team SREs do this 24×7 is indeed interesting.I have already heard about SRE online, but I wasn't sure what really does a SRE do in daily job. This event helped me to know more about SRE.Some interesting points:- 2500 SREs are behind google systems.- Have multiple versions of your apps for availability.In the event I asked a question to Alexis about what would happen if we run out of error budget.Finally I didn't do networking with other attendees, but I am free to connect and discuss. :)Thanks for great webinar.
Getting Started with GIT - Liveproject from Manning Publications Free
Manning Publications is releasing one of their liveProjects for free. This is a great resource for those who are starting to use git for their projects.Looking to begin your adventures in Git? Getting Started with Git is a liveProject by Rafiullah Hamedy, which will help you master the basics of Git by completing the kind of day-to-day tasks done by professional programmers.Get it here: http://mng.bz/l97B
Calling all Cloud Natives..this is an amazing interactive resource from the Cloud Native Computing Foundation
https://landscape.cncf.io/This is an amazing, interactive, and simple tool to better understand the Cloud Native Landscape. Kudos to the team over at the Cloud Native Computing Foundation for creating this. Take a quick peek on the features and functionalities and then bookmark it as it is open sourced and updated by their community all of the time. My favorite two features...the filters on the left to change the landscape views and the data and information served up when you click on each company’s tile in the trail map. JB
Google offering up free SRE classroom training!
If you are interested in getting your hands dirty as you learn more about modern SRE principles, Google recently posted a great series of content, tutorials, and hands-on for SRE classroom training. Here is the abstract on it: SRE Classroom is a collection of workshops developed by Google's Site Reliability Engineering group. The goals of this workshop are to (1) introduce participants to the principles of non-abstract large systems design (NALSD), and (2) provide hands-on experiences with applying these principles to the design and evaluation of these systems. We consider NALSD a concept fundamental to SRE, and understanding its principles provides a basis for having meaningful conversations about the design and operation of large software systems. Here is the actual post: https://sre.google/classroom/
Google Cloud Platform Tutorial: From Zero to Hero with GCP
This article from YourDevOpsGuy aims “to take you from zero to hero in Google Cloud Platform (GCP) in just one article.”It covers:Get started with a GCP account for free Reduce costs in your GCP infrastructure Organize your resources Automate the creation and configuration of your resources Manage operations: logging, monitoring, tracing, and so on. Store your data Deploy your applications and services Create networks in GCP and connect them with your on-premise networks Work with Big Data, AI, and Machine Learning Secure your resourcesCheck it out and upvote it if you think we should feature it on our Learn resource collection.
Already have an account? Login
Social LoginLogin With Your C2C Credentials
Login to the community
Social LoginLogin With Your C2C Credentials
Enter your username or e-mail address. We'll send you an e-mail with instructions to reset your password.