Learn | C2C Community

Getting Started with Site Reliability Engineering (SRE) - Key Takeaways

On March 17, 2022, the C2C Connect: UK and I group, led by Charlotte Moore (@charlotte.moore), Andy Yates (@andy.yates), Fintan Murphy (@fintan.murphy), Paul Lees (@paul.less), Sathy Sannasi (sathyaram_s.), and Yasin Quareshy (YasinQuareshy), invited Google Cloud Developer Advocate Alexis Moussine-Pouchkine to join them for an hour-long session on Site Reliability Engineering. The group’s monthly sessions bring together a local community of cloud experts and customers to connect, learn, and shape the future of cloud.  60 Minutes Summed Up in 60 Seconds  Pouchkine started the session by citing a number of publications and books on SRE, and then introduced the focus of the session: the Service Management aspect of SRE, and how it is applied at Google. Next, Pouchkine introduced DevOps Research Assessment (DORA), which helps measure how an organization compares to the best organizations in its delivery of its services, and how close the organization is to becoming an elite performer. Pouchkine shared key metrics DORA uses to measure a team's software delivery performance and explained how to set up an environment using FourKeys (available on GitHub) to implement workload measurement methods. To demonstrate practical implementation, Pouchkine introduced Pic-A-Daily App as a SRE use case. Pic-A Daily App is a photo recognition app that tags an image into a searchable category and an event driven microservice app with several delivery components. Next, Pouchkine gave his definition of SRE, making reference to the billions of users of Google's services and the 2,500 SREs responsible for the reliability of these services. He also discussed balancing reliability with agility. Pouchkine discussed tools, infrastructure observability, and culture in detail, citing the following key metrics used to measure impacts on a customer:  Service Level Indicator (SLI), which captures metrics that impact a customer, e.g. availability, latency. Service Level Objective (SLO), or the quality of service promised, e.g. error budget. Service Level Agreement (SLA), a business driven metric not used by the SRE. Pouchkine also discussed some recommended SRE best practices to follow: Versioning your software. Having multiple versions of software deployed and ready to serve requests if needed.   Canary Blue/Green deployments to provide flexibility and confidence in rolling back releases (if required) and A/B testing your software. Google Cloud Tools discussed that help diagnose and remediate faults. Having a centralized view of things rather than using multiple locations to identify issues. The climax of the session was a demo of Pic-A Daily App demonstrating how the tooling and SLO metrics can be used to identify and diagnose a fault. Tools that support the SRE include monitoring, error reporting, debugger, logging, traces, and profiler The session closed with a Q&A and some available resources on the topic.  Watch the full recording of this event below:  Despite its 60-minute time limit, this conversation didn’t stop. What are your thoughts on SRE, Service Management, DORA, or any of the other topics discussed above? Reply in the comments below or start a new topic on our group page.Be sure to sign up for C2C and join our C2C Connect: UK and Ireland group to connect with Google Cloud customers and experts based in the UK & Ireland and beyond Extra Credit  SRE Resources  DORA at C2C

Categories:DevOps and SRESession Recording

Monitoring and observability Drive Conversation at C2C Connect: France Session on January 11

On January 11, 2022, C2C members @antoine.castex and @guillaume blaquiere hosted a powerful session for France and beyond in the cloud space. C2C Connect: France sessions intend to bring together a community of cloud experts and customers to connect, learn, and shape the future of cloud. 60 Minutes Summed Up in 60 Seconds  Yuri Grinshteyn, Customer SRE at Google Cloud, was the guest of the session. Also known as “Stack Doctor” on YouTube, Grinshteyn advocates the best way to monitor, observe and follow the SRE best practices as learnt by Google in their own service SRE teams. Grinshteyn explained the difference between monitoring and observability: Monitoring is “only” the data about a service, a resource. Observability is the behavior of the service metrics through time. To observe data, you need different data sources; metrics, of course, but also logs and traces. There are several tools available, but the purpose of each is observability: FluentD, Open Sensus, Prometheus, Graphana, etc. All are open-source, portable, and compliant with Cloud Operations. The overhead of instrumented code is quite invisible, and the provided metrics are much more important than the few CPU cycles lost because of it. Microservices and monoliths should use trace instrumentation. Even a monolith never works alone: it uses Google Cloud Services, APIs, Databases, etc. Trace allows us to understand North-South and East-West traffic.  Get in on the Monitoring and Observability Conversation! Despite its 30-minute time limit, this conversation didn’t stop. Monitoring and observability is a hot topic, and it certainly kept everyone’s attention. The group spent time on monitoring, logging, error budget, SRE, and other topics such as:  Cloud Operations Managed Services for Prometheus Cloud Monitoring Members also shared likes and dislikes. For example, one guest, Mehdi, “found it unfortunate not to have out of the box metrics on GKE to monitor golden signals,” and said “it’s difficult to convince ops to install Istio just for observability.”  Preview What's Next Two upcoming sessions will cover topics that came up but didn’t make it to the discussion floor: If either of these events interests you, be sure to sign up to get in touch with the group! Extra Credit Looking for more Google Cloud products news and resources? We got you. The following links were shared with attendees and are now available to you! Video of the session Cloud Monitoring Managed Services for Prometheus Sre.google website SRE books Stack Doctor Youtube playlist  

Categories:Data AnalyticsDevOps and SRECloud Operations

How to Choose a Virtual Machine (VM) From Google Cloud

Google Cloud provides virtual machines (VMs) to suit any workload, be it low cost, memory-intensive, or data-intensive, and any operating system, including multiple flavors of Linux and Windows Server. You can even couple two or more of these VMs for fast and consistent performance. VMs are also cost-efficient: pricier VMs come with 30% discounts for sustained use and discounts of over 60% for three-year commitments.Google’s VMs can be grouped into five categories. Scale-out workloads (T2D) If you’re managing or supporting a scale-out workload––for example, if you’re working with web servers, containerized microservices, media transcoding, or large scale java applications––you’ll find Google’s T2D ideal for your purposes. It’s cheaper and more powerful than general-purpose VMs from leading cloud vendors. It also comes with full Intel x86 CPU compatibility, so you don’t need to port your applications to a new processor architecture. T2D VMs have up to 60 CPUs, 4 GB of memory, and up to 32 Gbps networking.Couple T2D VMs with Google Kubernetes Engine (GKE) for optimized price performance for your containerized workloads. General purpose workloads (E2, N2, N2D, N1) Looking for a VM for general computing scenarios such as databases, development and testing environments, web applications, and mobile gaming? Google’s E2, N2, N2D, and N1 machines offer a balance of price and performance. Each supports up to 224 CPUs and 896 GB of memory. Differences between VM types E2 VMs specialize in small to medium databases, microservices, virtual desktops, and development environments. E2s are also the cheapest general-purpose VM. N2, N2D, and N1 VMs are better equipped for medium to large databases, media-streaming, and cache computing.  Limitations These VMs don’t come with discounts for sustained use. These VMs don’t support GPUs, local solid-state drives (SSDs), sole-tenant nodes, or nested virtualization. Ultra-high memory VMs (M2, M1) Memory-optimized VMs are ideal for memory-intensive workloads, offering more memory per core than other VM types, with up to 12 TB of memory. They’re ideal for applications that have higher memory demands, such as in-memory data analytics workloads or large in-memory databases such as SAP HANA. Both models are also perfect for in-memory databases and analytics, business warehousing (BW) workloads, genomics analysis, and SQL analysis services. Differences between VM types: M1 works best with medium in-memory databases, such as Microsoft SQL Server. M2 works best with large in-memory databases.  Limitations These memory-optimized VMs are only available in specific regions and zones on certain CPU processors. You can’t use regional persistent disks with memory-optimized machine types. Memory-optimized VMs don’t support graphic processing units (GPUs). M2 VMs don’t come with the same 60-91% discount of Google’s preemptible VMs (PVMs). (These PVMs last no longer than 24 hours, can be stopped abruptly, and may sometimes not be available at all.)  Compute-intensive workloads (C2) When you’re into high-performance computing (HPC) and want maximum scale and speed, such as for gaming, ad-serving, media transcribing, AI/ML workloads, or analyzing extensive Big Data, you’ll want Google Cloud’s flexible and scalable compute-optimized virtual machine (C2 VMs). This VM offers up to 3.8 GHz sustained all-core turbo clock speed, which is the highest consistent performance per core for real-time performance. Limitations You can’t use regional persistent disks with C2s. C2s have different disk limits than general-purpose and memory-optimized VMs. C2s are only available in select zones and regions on specific CPU processors. C2s don’t support GPUs.  Demanding applications and workloads (A2) The accelerator-optimized (A2) VMs are designed for your most demanding workloads, such as machine learning and high-performance computing. They’re the best option for workloads that require GPUs and are perfect for solving large problems in science, engineering, or business. A2 VMs range from 12 to 96 CPUs, offering you up to 1360 GB of memory. Each A2 has its own amount of GPU attached. You can add up to 257 TB of local storage for applications that need higher storage performance. Limitations You can’t use regional persistent disks with A2 VMs. V2s are only available in certain regions and zones. V2s are only available on the Cascade Lake platform.  So: which VM should I choose for my project? Any of the above VMs could be the right choice for you. To determine which would best suit your needs, take the following considerations into account: Your workload: What are your CPU, memory, porting and networking needs? Can you be flexible, or do you need a VM that fits your architecture? For example, if you use Intel AVX-512 and need to run on CPUs that have this capability, are you limited to VMs that fit this hardware, or can you be more flexible? Price/performance: Is your workload memory-intensive, and do you need high performance computing for maximum scale and speed? Does your business/project deal with data-intensive workloads? In each of these cases, you’ll have to pay more. Otherwise, go for the cheaper “general purpose” VMs. Deployment planning: What are your quota and capacity requirements? Where are you located? (remember - some VMs are unavailable in certain regions). Do you work with VMs? Are you looking for a VM to support a current project? Which VMs do you use, or which would you consider? Drop us a line and let us know! Extra Credit: 

Categories:InfrastructureComputeDevOps and SRE

What is CI/CD?

It’s tough to imagine a time when developers released their software without a preview at the end of their negotiated terms. Client dissatisfaction, miscommunications, and faulty code were only a few of the problems developers commonly faced.2001 saw the publication of the Manifesto for Agile Software Development, which suggested that DevOps teams split work into modules and commit each bit of code to a repository, where the entire team reviews the code for bugs or errors in a process called continuous integration (CI). In this model, once that code is vetted, it is passed on for continuous delivery (CD), where it is mock-tested, before being submitted for continuous deployment (CD), to ensure flawless results.Since CI/CD is consistent and continual, some DevOps teams focus on CI, while others work on CD to move that code further down the pipe. CI in shortOnce a day, or as frequently as needed, developers commit––or “continuously integrate” (CI)––their built codes in a shared repository, where the team checks it for errors, bugs, and the like. Tests include: Unit testing that screens the individual units of the source code; Validation testing that confirms the software matches its intended use, and; Format testing that checks for syntax and other formatting errors.  TL;DRAfter successful continuous integration, the code moves on to continuous delivery or release, followed by continuous deployment (CD).Both CI and CD are rinse-and-repeat situations, looping endlessly until DevOps is satisfied with the software.Check out a handful of popular CI/CD environments; which one do you use? Jenkins, an OS solution CircleCI Gitlab Github Actions TeamCity, a proprietary solution.  CI/CDWhether organizations add CD onto CI depends on the size of the organization. Complex organizations like the Big 3 (Google, Amazon, and Facebook) use all three stages. Smaller organizations, or those with fewer resources, mostly leave out either the continuous deployment or the continuous delivery stage, preferring the more fundamental continuous integration module. CI helps DevOps practice fault isolation, catch bugs early, mitigate risks, and automate successful software roll-out. Why Use CI/CD?CI/CD helps software developers move flawless products to market faster.  Benefits include the following: Manual tasks are easily automated.  Minor flaws are detected before they penetrate the main codebase. CI/CD helps prevent technical debt, where overlooked flaws turn into expensive remediation down the line. CI/CD processes can shorten cycle times.  The CI/CD process potentially makes software developers more productive, efficient, and motivated. Considerations include the following:  CI/CD requires a talented DevOps engineer on your team, and these engineers are expensive.  CI/CD requires proper infrastructure, which is also expensive. Few legacy systems support the CI/CD pipeline, so you may have to build your IT from scratch. CI/CD also requires rapid response to alerts - like integration deadlines and issues raised by the unit team - for the system to work as it should. Other options: If the hassle and expense seem overwhelming, there are alternatives. Consider the shortlist below:  AWS CodePipeline Spinnaker Google Cloud Build Buddy DeployBot  Bottom Line:According to a recent Gartner report, CI/CD has become a staple for top-performing companies during the last three years, with 51% of these organizations ready to check these agile SaaS developments within the next five years. CI, too, remains more popular than CD, especially for smaller organizations.For Google Cloud users like us, Cloud Build, with its particular CI/CD framework that works across various environments, is ideal. It helps us improve our lead time (namely, decreasing the time it takes from building to releasing the code), boosts our deployment frequency, reduces Mean Time to Resolution (MTTR), and dramatically reduces the number of potential software defects.Cloud Build can include VMs, serverless, Kubernetes, or Firebase, and helps us build, test, and deploy software quickly and at scale.  So, how have you used Cloud Build or CI/CD? Did any of this resonate? I’d love to hear your story!

Categories:Application DevelopmentDevOps and SRE

Cloud Usage, Positive Team Culture and Good Documentation Among Key Findings from State of DevOps Report

Earlier this year C2C and DORA collaborated to bring the C2C Community a potential answer to the question every manager asks: “how do I build a high-performing team?” The State of DevOps Report by the DevOps Research and Assessment (DORA) team at Google Cloud represents seven years of research and data from more than 32,000 professionals worldwide, according to the research team. Their research “examines the capabilities and practices that drive software delivery, operational, and organizational performance,” according to the research team.The team uses rigorous statistical techniques to “understand the practices that lead to excellence in technology delivery and to powerful business outcomes.” Their data-driven insights detail the most effective and efficient ways to develop and deliver technology.Our research continues to show that excellence in software delivery and operational performance drives organizational performance in technology transformations. To allow teams to benchmark themselves against the industry, we use a cluster analysis to form meaningful performance categories (such as low, medium, high, or elite performers). After your teams have a sense of their current performance relative to the industry, you can use the findings from our predictive analysis to target practices and capabilities to improve key outcomes, and eventually your relative position. This year we emphasize the importance of meeting reliability targets, integrating security throughout the software supply chain, creating quality internal documentation, and leveraging the cloud to its fullest potential. We also explore whether a positive team culture can mitigate the effects of working remotely as a result of the COVID-19 pandemic. To make meaningful improvements, teams must adopt a philosophy of continuous improvement. Use the benchmarks to measure your current state, identify constraints based on the capabilities investigated by the research, and experiment with improvements to relieve those constraints. Experimentation will involve a mix of victories and failures, but in both scenarios teams can take meaningful actions as a result of lessons learned.Key Findings The report is lengthy but it details six key areas that help ensure strong, efficient and effective software delivery teams. Please note, these findings re-published courtesy the DORA team’s research report.  The highest performers are growing and continue to raise the bar. Elite performers now make up 26% of teams in our study, and have decreased their lead times for changes to production. The industry continues to accelerate, and teams see meaningful benefits from doing so.   SRE and DevOps are complementary philosophies. Teams that leverage modern operational practices outlined by our Site Reliability Engineering (SRE) friends report higher operational performance. Teams that prioritize both delivery and operational excellence report the highest organizational performance.   More teams are leveraging the cloud and see significant benefits from doing so. Teams continue to move workloads to the cloud and those that leverage all five capabilities of cloud see increases in software delivery and operational (SDO) performance, and in organizational performance. Multi-cloud adoption is also on the rise so that teams can leverage the unique capabilities of each provider.   A secure software supply chain is both essential and drives performance. Given the significant increase in malicious attacks in recent years, organizations must shift from reactive practices to proactive and diagnostic measures. Teams that integrate security practices throughout their software supply chain deliver software quickly, reliably, and safely.   Good documentation is foundational for successfully implementing DevOps capabilities. For the first time, we measured the quality of internal documentation and practices that contribute to this quality. Teams with high quality documentation are better able to implement technical practices and perform better as a whole.   A positive team culture mitigates burnout during challenging circumstances. Team culture makes a large difference to a team’s ability to deliver software and meet or exceed their organizational goals. Inclusive teams with a generative1,2 culture experienced less burnout during the COVID-19 pandemic.Read the article on the Google Cloud Blog, but if you want the full report, here is the PDF.  New to DORA and this insight? We got you covered! Check out the links to previous DORA articles, videos and more from C2C. Extra Credit First event with @nathenharvey  explaining DORA to the C2C Community and previous findings:Second event with @nathenharvey discussing results and how to use them: The C2C final call to the community to participate with more details about what DORA is, and why it’s valuable:Hey, you should check out DORA, you can learn more about it in this article:  

Categories:DevOps and SRE

CI/CD DevOps Continuous Testing Strategy Integration and Methods

This article will aim to explain the importance of DevOps continuous testing in our ever-accelerating world. We will delve into various methods for seamless integration with your team, the myriad benefits of constant testing in DevOps, and its handful of drawbacks. DevOps is a relatively new concept within organizations that have in-house operations and development teams. It facilitates both teams to come together and collaborate on faster and more efficient software development solutions. The result is a speedier software development lifecycle (SDLC) with fewer bugs and more frequent patching and updates. What is Continuous Testing? Continuous testing in DevOps is the constant prodding of an application for errors in code or communication. It is omnipresent through both the continual integration and successive development phases of the CI/CD process. Traditionally, developers compiled their code and sent it off to a testing environment. Then, a quality assurance (QA) team ran tests against the code to find bugs and vulnerabilities. If anyone finds issues the QA team sent them back to the developer to fix, and the process is repeated.QA worked as fast as possible to find bugs, but deadlines were often missed, and patches might not be deployed for weeks. DevOps brings a new way to perform testing by automating the process. As developers code, their pieces are compiled and tested continuously. The automation scripts find bugs and send notifications to developers when issues are found. This method speeds up the testing process and gives developers feedback before they deploy to a testing environment. What are the Benefits of Continuous Testing in DevOps? The benefits of continuous testing in DevOps share a common motif: speed. Over-capacity programmers forged this prodding process to keep pace with their shrinking release schedules. However, as constant testing carved away more significant chunks of their valuable time, many independently created automated & manual testing processes to rewind the clock. Each process uncovered the profound benefits of continuous testing in DevOps and encouraged further exploration. Benefits:  Quickly and reliably deliver code changes Automated steps create additional time to focus on code quality & security  Smaller code differentials allow for easier identification of defects  Eliminates testing bottlenecks  More agility with a constant understanding of what is working well and what is not  This leads to more educated release decisions Delivers a consistent experience to users   Benefits flow to end-users as well. For example, a business that relies heavily on its software might make changes based on user feedback (e.g., a gaming developer), and DevOps continuous testing improves the velocity of development. Testing is performed automatically, feedback is sent immediately to developers, and developers can fix bugs as reported rather than wait for a human QA team. What is Not a Benefit of Continuous Testing? The benefits of continuous testing in DevOps are far-reaching, but even these streamlined processes come with their own set of drawbacks. Often, leaner companies do not have the required resources to set up proper continuous testing processes, or the time investment outweighs the benefits for their relatively static application. Disadvantages:  Additional training may be required for developers not accustomed to CI/CD timelines.  Testing requires some technical architecture skills and honed time management.  Establishing tests can be time-consuming, with variations and workarounds often needed for the many layers of an application.  Most businesses find that the most significant disadvantage is the additional cost of infrastructure and in-house staff. The DevOps team requires new staff trained to deal with development, operations, and testing automation. This team could be current staff, but then staff needs training on new software and infrastructure.Other additional costs include the software and hardware necessary to execute automation. Free, open-source software is available, but it usually does not have the features for enterprise DevOps. In addition, for new DevOps teams, staff often need time to test and review the software before it’s implemented as part of the automation process, which takes time and resources from other development projects. Continuous Testing Integration & Methods Architects understand that varying climates require unique designs suited to regional temperature fluctuations. For example, homes in colder regions often have roofs at steeper angles to prevent snow accumulation. In contrast, warmer weather homes utilize heavier building materials to absorb heat during the day and dissipate it at night. The same approach can be seen by technical architects, deploying tests to amplify the application's benefits while mitigating and preparing for its weaknesses.   Suppose a technical architect designed a DevOps testing strategy for an application with ample moving parts passing through many sets of the developer’s hands. In that case, a unit testing method might be selected. This enables rapid feedback on new code by breaking down and testing the code in digestible units. Unfortunately, unit testing is only one of the many tools a technical architect possesses in their “Mary Poppin’s-esque” tool chest. Additional examples include production testing, exploratory testing, automated end-to-end testing, regression testing, etc. Each of these examples possesses its benefits of continuous testing in DevOps and drawbacks to be strategically implemented case by case. User Acceptance TestingUser acceptance testing (UAT) is a specific type of continuous testing to determine if the application satisfies the user’s expectations and needs. Software design starts with collecting user requirements, and they must be tested to ensure that contracts are met.Testing can be done by other developers or users themselves, depending on the target. For example, in-house developers test software meant for external users before presenting it to the customer, but the end-user might test software when it’s meant for in-house tools. For basic requirements, automated scripts test for functionality defined in contracts and planning documents.Performance TestingAnother widely used DevOps testing strategy is performance testing. This method of testing ensures that software does not crash from too many concurrent users. It tests for loads dependent on the number of users projected to use the software. For example, if 1000 users are projected to use the application concurrently, then performance testing ensures that software executes efficiently with 1000 user connections and no performance degradation is detected.Smoke TestingSmoke testing is a continuous testing process that uses subsets of testing procedures to test the most critical parts of a new code deployment functionality. This testing step is performed right before the code is passed to QA. It ensures that basic functionality works before human testers review software features. Smoke testing saves time on QA because it finds bugs before staff spends too much time on buggy code. User Interface TestingUser interface elements are created by designers and sent to developers to integrate into the code. These elements must be tested for bugs and user errors. For example, a layout might consist of desktop and mobile designs. UI testing ensures that features and assets render properly on both screen sizes and both layouts dynamically render based on devices and their screen resolutions.Regression TestingAs developers add new features to the software, they must ensure that changes do not affect current functionality. Regression testing ensures that recent code changes don’t break current features in software. It can be automated with some human QA also reviewing old features to ensure that users do not lose functionality when new functionality is added. Using Google Cloud Monitoring With Continuous Testing Frameworks Continuous testing is a highly effective tool for improving code deployments and creating better coding practices. Still, developers and cloud engineers alike can benefit significantly from integrating a continuous testing process with GCP.With Google Cloud Monitoring and the cloud operations suite, teams can use metrics and traces to gather meaningful insights from every build and establish performance indicators across cloud applications. In addition, DevOps teams can evaluate the success of their deployments and automation scripts to improve testing continuously.For developers, continuous integration and testing mean faster velocity during the software development lifecycle. Developers can deploy updates and patches faster because continuous testing gives feedback on the latest code changes. As developers write code, they know to fix bugs before it’s compiled and sent to QA. With automation and GCP, developers can reduce deployment times from several weeks to potentially minutes. In addition, automation speeds up deployment, and GCP provides the reliability and scalability needed for enterprise applications. Extra Credit What is CI/CD? Continuous integration and continuous delivery explained Fitting Continuous Testing Into Your DevOps Pipeline

Categories:DevOps and SRE

The DevOps Life Cycle in 7 Steps Using Netflix

Netflix, the world's most popular movie streaming platform, deploys 100 pieces of code per day with only 70 engineers. So how do they do it? The DevOps lifecycle method that Netflix claims helps them squeeze “canaries into hours instead of days,” partly because it helps their engineers “quickly research issues and make changes rather than bouncing the responsibilities across teams.”Use DevOps principles with Google Cloud Platform (GCP) to develop your applications and services. Software methodologies: Waterfall, Agile, and DevOps Certain companies use the Agile and Waterfall methodologies to deploy software. Each, in its way, produces faster and more disciplined project delivery than before. But Waterfall, with its siloed groups, usually creates months-long delays that sometimes kill projects, while the more efficient Agile is pocked by miscommunication.New kid on the block, DevOps, augments Agile by: Merging the development and operations teams. (Some companies also blend Development, Operations, and Security into DevSecOps) Automating each part of their DevOps lifecycle Practicing continuous improvement using feedback from the blended team  DevOps through the Netflix lens By 2018, a frustrated Netflix had experimented with various software methodologies before it adopted the DevOps method of Belgian consultant Patrick DeBois, which gave shared ownership to developers and operators over the entire software lifecycle. Essentially, Netflix lets the developers test what they create while allowing the operators to fix what they operate. But, of course, DevOps engineers need to know the tools of both specialties. Here’s how the DevOps lifecycle works Netflix trots its DevOps software life cycle through seven phases. Each is called “continuous” because the team regurgitates each step before moving on to the next.1 - Continuous planning and developmentHere, continuous project planning and development or coding take place. But, first, the team sets out the idea and discusses which coding to use.  2 - Continuous code integration/ building  The team builds the code for unit testing, integration testing, code review, and packaging. It also commits to evaluate this code once a week and to integrate revisions and add-ons as needed. Most DevOps teams use Jenkins to execute their code. 3 - Continuous testing DevOps engineers test their developments in artificial environments created through Docker containers. They use automated testing tools like TestNG, JUnit, and Selenium, to root out bugs and malware deficiencies.  4 - Continuous feedbackThe team assesses how the code “behaves” in its simulated environment, then they incorporate improvements as needed before repeating the piloted tests. 5 - Continuous deployment The final application code is uploaded and deployed on the various production servers through tools like Vagrant, Docker, Ansible, and Kubernetes. Other devices, notably Chef, Puppet, Ansible, and SaltStack, automate the code. 6 - Continuous monitoringDevOps engineers assess how the code runs, checking for security and functionality issues. As a result, engineers use Nagios as one of their favorite tools. 7 - Continuous operations This is the shortest, most thrilling stage of all. The DevOps team automates the product release and subsequent updates. What Netflix does for the DevOps life cycle to work Engineers on the DevOps team have to be uniquely motivated and well-rounded. Each needs to have the skills of their specialized skill-set and the skill-set of SWE, SDET, and SREs. With their focus on automation, questions that DevOps engineers need to ask include “how can I automate what is needed to operate this system?” Netflix augments its DevOps lifecycle by investing in high-quality personnel, tools, and training. Then, when a product or service fails, Netflix spins it through five whys iteratively, probing to uncover the “how” and “why” the issue occurred and how to improve before returning the code to the DevOps lifecycle and optimizing as needed. Bottom Line The DevOps lifecycle helps Netflix deliver regular deployments through its continuous improvement and automation practices and blending development with operations.Netflix admits that the trade-off means sacrificing specialization for generalization. Still, grouping different specialists in one team reduce silos, eliminates bottlenecks, improves product accuracy and speed, and drastically cuts production costs. In 2018, Netflix testified that DevOps helped them grow to 125M global members who enjoy 140M+ hours of viewing per day.  Let’s Connect!Leah Zitter, Ph.D., has a Masters in Philosophy, Epistemology, and Logic and a Ph.D. in Research Psychology.

Categories:Industry SolutionsDevOps and SREMedia, Entertainment, and Gaming

What’s the Difference Between DevOps and SRE?

Back in 2009, IT consultant Patrick Debois merged developers who wrote code with operators who ran that code so that companies could complete their work better, faster, and cheaper. Around that time, red-hat experts called site reliability engineers (SREs) stepped in to help Google meet its internal needs by using software to manage systems, solve problems, and automate operations tasks. From Google, SREs made their way to other major companies. DevOps DevOps is composed of five key areas. People on these teams: Reduce organizational silos by improving the communication between the two teams. Encourage their members to work towards progress rather than perfection. Implement gradual change. Incremental changes are not only easier to review, but also to revise when necessary. Define standardized procedures and hierarchies for testing, developing, and automating tasks. Measure everything so they see what succeeds and which needs revision. DevOps is a mindset designed to increase business value through agile, high-quality service delivery. Site Reliability Engineers (SREs) SREs step in and contribute to each of the DevOps areas in these ways: They use one set of standards and tools across Dev and Op fields.  They quantify failure and risk through service-level indicators (SLIs) and service-level objectives (SLOs). (SLIs are metrics that occur over time, such as latency, availability, and error rate, while SLOs measure the cumulative effect of the SLOs, say error rate over the last month). They encourage developers and operators to move quickly by reducing the cost of failure. They spend as much as 50% of their time standardizing and automating “this year's” operation tasks. They measure metrics like availability, uptime, outages, automation, and so forth. Their work can be extremely strenuous. As Google’s own SRE Andrew Widdowson describes it, “Our work is like being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100 mph.” Comparison Between DevOps and SRE DevOps and SRE share the following responsibilities: Each eliminates the silos that divide their distinct groups and work to unify them. Both focus on standardizing the work and on automation. Without automation, the work becomes too expensive, too slow, and far too error-prone. Each spends equal parts of their time in Dev and Ops. And yet they differ in the following ways: DevOps reviews the entire life cycle of the project from the production stage to integration. SREs, on the other hand, limit themselves to nuances of production.  DevOps focuses on team synthesis, eliminating barriers, for cultural change above all else. SREs, on the other hand, obsess on standards and metrics.  DevOps is on product development and delivery. SREs zone in on the system. Is the system running? Is it readily available?  The DevOps team spans anyone involved in the life cycle of that product—product owners, devs, testers, operators, and so forth. SREs, in contrast, are engineers who exclusively write code.   Bottom LineGoogle compares DevOps to a programming language interface and SRE to a programming class used to implement DevOps. Consider them complementary rather than different. DevOps is on the what (“What needs to be done”) while SRE is on the how (“How should it be done”). DevOps are niche in that the process hones on the overall life cycle of the product from A to Z, while SREs tool in on the system. DevOps is a mindset, while SREs hop in to build proactive testing, observability, service reliability, and speed into that culture. You need both methodologies for DevOps-centered organizations.  Let’s Connect!Leah Zitter, Ph.D., has a Masters in philosophy, epistemology, and logic and a Ph.D. in research psychology.

Categories:Getting Started with Google CloudDevOps and SRE

Assess and Improve Your Team's Software Delivery Performance (full video)

Many thanks again to Nathen Harvey (@nathenharvey), cloud developer advocate at Google, for providing more information to prepare teams for DevOps and SRE best practices in the cloud.Ahead of reviewing this recording, be sure to complete the DORA DevOps Quick Check and follow along with the live demo.The full recording from this session includes:(1:30) Speaker introduction (2:15) What is DORA? (4:30) The four key metrics for software delivery and operations performance (9:30) Demo of the DORA DevOps Quick Check (17:50) Understanding your software delivery performance results (21:10) Using results to improve your team’s performance Performance comparisons across all industries or a specific industry Performance profile breakdown and performance levels Designated improvement areas (24:25) Prioritizing capabilities with ranked recommendations and guides on implementing changes Continuous integration Continuous delivery Architecture (27:35) Exploring DORA research and inferential predictive analysis and achieving optimal organizational performance (34:00) DORA Research Program Diagram Lean product development Change approvals Transformational lea Lean management Technical practices Culture and work environment Organizational performance SDO performance (41:45) Open Q&AFinally, participate in the survey for the State of DevOps Report. The full report is expected to be released this fall. Other Resources:DORA Research Program Diagram Take the DORA DevOps Quick Check Google Cloud DevOps

Categories:DevOps and SRESession Recording

What Is Observability in Software Engineering?

In large-enterprise environments, engineers use observability to identify software behavior, ensure execution as designed, and detect any anomalies. Observability is often referred to as how logs provide output about infrastructure so that administrators and software engineers can monitor a system. It also gives insights to software engineers about the way users interact with their applications and provides information for future innovation and scalability. What Is Application Observability? The concept of observability first originated in control systems engineering, and only recently has it been applied to cloud computing and DevOps. In its simplest form, application observability is a measurement of the internal state of a complex system based on external output.Full-stack observability refers to the strategies and tools used to aggregate performance data from cloud computing applications. A “stack” in this context is the technology used to run a system. For example, the LAMP stack stands for “Linux, Apache, MySQL, and PHP” technologies. These technologies together could run your web applications. Facilitating observability of cloud stacks, you can use logs and analytics tools to find the information necessary for determining if the application is running as intended, its performance is sufficient, and any anomalies are detected and reported.  Key Differences Between Observability and Monitoring While observability and monitoring are different phenomena in performance management and analysis, they do share some relation. Namely, monitoring is born out of observability because a machine or an instrument cannot be monitored if it isn’t observable. And herein also lies some of their most significant differences.When comparing and contrasting monitoring vs. observability, one of the main differences lies in the “how” and the “why” of app performance. That’s because monitoring gives engineers a much more general view of their system’s health and alerts developers to internal anomalies only after they’ve occurred. Full stack observability allows for continuous telemetry, giving engineers contextual insight that will enable them to answer the “why” behind an inconsistency.Additionally, another notable difference between observability and monitoring is that monitoring is built around existing questions used to interrogate performance metrics. In contrast, observability allows developers to impart new questions onto their systems to understand response status and debug issues.  Why Is Full Stack Observability Important in DevOps? The use of observability in telemetry is not a new concept. It’s key to understanding the internal states of complex instruments. So it’s no surprise that observability has found its way into cloud computing and DevOps, which often rely on continuous cycles to keep up with the on-demand world we inhabit today.This immediacy of modern software development has given rise to a need for closeness in performance management. Full-stack observability allows developers to test requests millisecond by millisecond, allowing for more precise telemetry and a quicker resolution of existing bugs.DevOps teams use automation to scale code deployments, and observability gives insight into deployment success. Observability can determine scaling requirements, deployment changes to make, and any errors causing testing issues. It helps developers and operations better understand the way DevOps tools execute and provides metrics on the performance of deployment applications.This dynamic operations model also creates greater demand for agile performance management and diversifies the types of questions developers ask throughout the debugging process. When considering the debate between monitoring vs. observability, asking better questions makes observability conducive to a performance analysis model.  Observability Strategies for Continuous Performance Visualization Building an observable system isn’t complex, but it is beneficial. To create better full stack observability in DevOps, engineers can turn to these four pillars of performance metrics to establish KPIs and create an observable system that will help them uncover the “why” behind any anomalies in their applications operations.  Using MetricsThe metrics displayed in observability tools can help developers with innovation in future product releases. Observability tools collect data from different endpoints and applications so that developers can spot trends over a specified duration. For example, developers could find trends in application performance and usage spikes for a month to determine the organization’s busiest time of the day.   Using TracesTraces are necessary for tracking known issues. For example, developers can use traces to determine the cause of a specific error. These errors could be traced to particular code issues, infrastructure failure, or insufficient resources and scaling.  Using DependenciesModern software uses numerous dependencies and libraries that perform application functions. External developers could manage these dependencies, so they must be monitored for changes and errors. Observability tools map dependencies to application architecture to track for updates, vulnerabilities, and code audits. Using LogsLogs play a significant role in full stack observability and precise debugging because they compile real-time insights that allow developers to quickly play back potential bugs and troubleshoot issues. Logs are essential in monitoring and observability in the cloud and on-premise environments. They contain all data collected from various locations within the environment for analytics, visualization tools, and other reporting applications. Building an Observable System with Stackdriver Logging Google Cloud’s answer to the growing need for observable systems in DevOps Stackdriver Logging is a performance management tool that leverages logging and tracing to visualize performance issues more seamlessly and with greater context.What makes Stackdriver Logging beneficial is that it takes the logged data collected across your cloud environment and turns it into visualizations like graphs and charts that give you easily recognizable insights. You also have trace abilities to identify anomalies and speed up bug fixing, should your application have errors.Because Google Cloud Platform was made for analytics, Stackdriver Logging and Stackdriver Trace are complete observability tools that help developers and operations monitor cloud applications. They use advanced BigQuery techniques to pull data from logs that could have billions of line items for large platforms. These observability tools are designed to help you monitor cloud applications for performance and reliability if you use Google Cloud Platform. Extra Credit https://thenewstack.io/monitoring-vs-observability-whats-the-difference/ https://www.ibm.com/cloud/learn/observability https://www.sumologic.com/glossary/observability/ https://newrelic.com/blog/best-practices/what-is-modern-observability   

Categories:DevOps and SRE

Drive Your Team's Software Delivery Performance (full video)

This C2C Deep Dive was led by Nathen Harvey (@nathenharvey), cloud developer advocate at Google who helps the community understand and apply DevOps and SRE practices in the cloud.The Google Cloud DORA team has undertaken a multi-year research program to improve your team’s software delivery and operations performance! In this session, Nathen introduced the program, research findings, and invited Google Cloud customer Aeris to demonstrate the tool in real-time.Participate in the survey for the State of DevOps Report by July 2.The full recording from this session includes:(1:40) Speaker introduction (3:20) How technology drives value and innovation for customer experiences (4:20) Using DORA for data-driven insights and competitive advantages (7:15) Measuring software delivery and operations performance Deployment frequency Lead time for changes Change fail rate Time to restore service (14:20) Live demonstration of the DevOps Quick Check with Karthi Sadasivan of Aeris (23:00) Assessing software delivery performance results Understanding benchmarks from DORA’s research program Scale of low, medium, high, or elite performance Predictive analysis by DORA to improve outcomes (29:30) Using results to improve performance Capabilities across process, technical, measurement, and culture Quick Check’s prioritized list of recommendations (37:40) Transformational leadership for driving performance forward Psychological safety Learning environment Commitment to improvements (41:45) Open Q&A Other Resources:Take the DORA DevOps Quick Check  Results from the Aeris software delivery performance assessment Google Cloud DevOps 

Categories:Application DevelopmentDevOps and SRECloud OperationsSession Recording

The 7 Phases of Software Development

More than two decades ago, the software development team and the operations team were separated by a wall, where developers “developed” the product from scratch and then tossed it over the wall for the operations team to test and implement. Naturally, this process lasted weeks, if not months, as each team sat with folded arms waiting for the other team to toss the software back to them for rework and fixing. The process was not only costly, but it also risked the company's survival, as competitors raced ahead with their own innovations.In 2008, a frustrated Belgian consultant, Patrick DeBois, bulldozed that wall with his DevOps concept, where the development team and the operation team sit down at the same tables and collaborate to unroll innovations. Each makes certain accommodations such as splitting work into minute segments, learning the other team’s “language,” and veering towards open information.The results are faster, more agile, more correct, and cheaper performance—among other benefits.The DevOps Life CycleFor DevOps to work, its life cycle is split into seven phases. The development team executes the first four stages: Planning: The team outlines the process, milestones, and goals. Coding/Version Control: The team uses tools like Git for coding before storing projects in a repository. Building: The code is made executable through tools like Maven and Gradle. Testing: Selenium is one of the most popular options used when testing for bugs or errors.  Once the code has passed several manual and automated tests, it’s tossed over to the operations team for the subsequent phases: Deployment: This is where code is placed on the web server. Upgrades or new features could be added. Operation: Operations automates the code through tools like Docker, Ansible, and Kubernetes. Monitoring: Nagios is one of the top tools used to assess that the product works. Feedback is tossed back to the developers—and so the system loops round in one so-called continuous delivery (CD) cycle, until developers and operations are satisfied. DevOps in PracticeMajor companies that use DevOps include Amazon, Netflix, Target, NASA, and Hertz. These and other organizations make the system work by extracting a few individuals from both groups and sitting them in adjacent cubicles or around one table.For Jeff Bezos, groups should be small enough that two pizzas should be enough to feed the entire party.  The DevOps PersonnelFor this rather complex structure to work, you need at least four main personnel to run it: The release manager, who coordinates the product from development through production The software developer/tester The security engineer The automation architect, also called the “integration specialist” Needless today, the fast-paced DevOps environment requires a special breed of people who can keep up with the sprint and work well together, as they know their technology. Bottom LineCompanies that want to survive in today’s hyper-innovative world need to produce software not every three months, as in the past—but every hour. To do this, they have to be nimble, accurate, and frictionless. And that’s where the revolutionary concept of DevOps comes in, where the development and operations team look over each other's shoulders as each performs its own task. The results? A 45% increase in workplace productivity, according to a CA Technologies study. Let’s Connect!Leah Zitter, Ph.D., has a masters in philosophy, epistemology, and logic and a Ph.D. in research psychology.There is so much more to discuss, so connect with us and share your Google Cloud story. You might get featured in the next installment! Get in touch with Content Manager Sabina Bhasin at sabina.bhasin@c2cgobal.com if you’re interested.

Categories:DevOps and SRE

Take the 2021 State of DevOps survey: Shape the future of DevOps

We are so excited to partner with Google Cloud on a research project testing and improving the performance of software delivery teams in a live C2C Talks event on June 9 and discussion on the results on June 23.Read on to learn more about the survey and how C2C community members can get involved.  What’s the survey about?  DORA's State of DevOps research program represents six years of research and data from over 31,000 professionals worldwide. It is the longest-running academically rigorous research investigation of its kind, providing an independent view into the practices and capabilities that drive high performance in technology delivery and ultimately organizational outcomes. Our research uses behavioral science to identify the most effective and efficient ways to develop and deliver software.Use our quick check tool to discover how you compare to industry peers, identify specific capabilities you can use to improve performance and make progress toward becoming an elite performer.Google Cloud and the DORA research launched the 2021 State of DevOps survey this week. The survey takes approximately 25 minutes to complete, and we are looking for C2C Community members to participate. Why do our answers matter? Your answers will allow the Google Cloud DORA team to understand better the practices that teams are employing to improve software delivery performance and, in turn, generate powerful business outcomes. Tell me more about the DevOps report, please. The State of DevOps report provides an independent view into the practices and capabilities that organizations can employ to drive better performance irrespective of their size, industry, and region.  Like the past six research reports, the goal this year is to perform detailed analysis to help various teams benchmark their performance against the industry as elite, high, medium, or low performers. We also look to show specific strategies that teams can employ to improve their performance. The table below highlights elite, high, medium, and low performers at a glance from the last report.Achieving elite performance is a team endeavor, and diverse, inclusive teams drive the best performance. Likewise, the research program benefits from the participation of a diverse group of people. Please help us encourage more voices by sharing this survey with your network, especially with your colleagues from underrepresented parts of our industry. The survey is for everyone, regardless of where you are on your DevOps journey, the size of your organization, or your organization's industry. There are no right or wrong answers; in fact, we often hear feedback that questions in the survey prompt ideas for improvement. Also, you can put many of the ideas into practice immediately. Some of the key topics we look to deep dive into this year include: Metrics and Measurement: Practices employed by high performing teams  SRE and DevOps: How do they fit together and how they impact performance How to best integrate security & compliance as a part of your app development  The impact of cloud, monitoring & observability, open-source, and documentation on performance Distributed teams: Practices to improve work/life balance and reduce burnout The state of multi-cloud computing  How can I get involved? We’re looking for one  C2C Community member to take the survey like with the DORA team on June 9. You’ll need to meet the criteria noted below to be considered and will be selected by C2C on May 27. The advance notice is to allow you and your teams to prepare and answer any preliminary questions.Feel free to reach out to Sabina@c2cglobal.com with your interest, or comment below. Criteria for consideration: You are open to having your team’s performance shared in a live C2C Talks event. You can attend the follow-on discussion on June 23 to share and discuss your results.  You’re open to participating in a case study with Google Cloud after June 23.Final Thoughts For those interested in the data but not eager to participate in the live event or being considered, you can check it out on your own. The survey will remain open until midnight PST on June 11, 2021. Stay tuned to our Events page for registration to both C2C Talks events. 

Categories:Google Cloud NewsGoogle Cloud StrategyDevOps and SRE

How to Auto-clean Your GCP Resources

A sub-dollar concierge for your Google Cloud environment A cook has to clean their kitchen at some point, right? It’s not just to humor the health inspection but also to keep things going as fluent and hygienic as possible. In the world of software engineering, this is no different: you’ll want to make sure that when you start your day, your pots and pans are clean.In this tutorial, we’ll craft a low-cost, cloud-native tool to keep your Google Cloud projects shiny. And what’s more, after completing this, you’ll be able to automate many more tasks using the same toolset!You can find a ready-to-go version of this setup called ZUNA (aka Zap Unused and Non-permanent Assets) on GitHub. Motivation When using cloud services, you can create new assets in a breeze. Even when your project is fully terraformed, you may still encounter some dirty footprints in your environment. Maybe it was that one time you quickly had to verify something by creating a Cloud SQL instance, or those cleanup scripts that occasionally fail when the integration tests go crazy.Indeed, a system can fail at any step: What if the instance running the tests breaks down? What if an unexpected exception occurs? What if the network is down? Any such failure can lead to resources not being cleaned up. In the end, all these dangling resources will cost you: either in direct resource cost, or in the form of toil¹.I do recognize that resources not being cleaned up might be the last thing on your mind when a production setup fails. Nevertheless, it’s still an essential aspect of maintaining a healthy environment, whether for development or production purposes. But don’t let this keep you from building auto-healing production setups! Deployment Overview We will create a system responsible for the automatic cleanup of specific resources in a GCP project. We can translate this into the following task: check periodically for labeled resources, and remove them.Ideally, the system is quick to set up, flexible, and low-cost. By the end of this post, our setup will look as follows: An overview of the ZUNA setup. In this post, we’ll focus on Pub/Sub Subscriptions.We will use the following GCP services to achieve this:Cloud Scheduler: takes care of automation and it will provide us with that literal press-of-the-button for manually triggered cleanups. Cloud Functions: a serverless Python 3 script to find and delete the GCP resources we’re interested in. You can easily extend such a script to include new resource types. Labels: many resources in GCP can be labeled; we’ll use this to mark temporary resources that should be removed periodically. IAM: we’ll make sure our system adheres to the least privilege principle by using a Service Account with only the required permissions.Using these services will quickly get you up and running while allowing multiple resources to be added later on. Moreover, as you’ll see later in this tutorial, this entire solution costs less than $1 per month. PrerequisitesA GCP project to test this code Permissions to use the services mentioned above Bash with the Google SDK (gcloud command) installed (you can also use Cloud Shell) Building Zuna We’ll chop this up into multiple steps:create some resources and attach labels to them make sure we have permissions to list and remove the resources craft a script that detects and removes the resources make the script executable in GCP trigger the script periodically optional cleanup Step 1: Create Resources First, we create a topic and a subscription so we have something to clean up. We’ll attach the label autodelete: true, so our script can automatically detect which resources are up for removal:#!/usr/bin/env bashGCP_PROJECT_ID=$(gcloud config get-value project)LOCATION=EU# Create Pub/Sub topicsgcloud pubsub topics create \ test-zuna-topic \ --labels=department=engineering# Create Pub/Sub subscription that should be removedgcloud pubsub subscriptions create \ test-zuna-subscription \ --topic=test-zuna-topic \ --labels=department=engineering,autodelete=true# Create Pub/Sub subscription that should NOT be removedgcloud pubsub subscriptions create \ test-zuna-subscription-dontremove \ --topic=test-zuna-topic \ --labels=department=engineering When you list the resources, you should see the labels:$ gcloud pubsub subscriptions describe test-zuna-subscriptionackDeadlineSeconds: 10expirationPolicy: ttl: 2678400slabels: autodelete: 'true' department: engineeringmessageRetentionDuration: 604800sname: projects/jonnys-project-304716/subscriptions/test-zuna-subscriptionpushConfig: {}topic: projects/jonnys-project-304716/topics/test-zuna-topic When you go to the cloud console, you should see the label appear on your newly created Pub/Sub subscription: Labels on subscriptions in the Cloud Console. Alright, we now have a resource that is up for deletion! When working with real resources, you can either label them manually or let your resource provisioning script take care of this. Next up: making sure we have permissions to delete these resources. Step 2: Get Permission To facilitate development later on, it’s best to work with a Service Account from the get-go. This account will be bound to your script when it executes and will provide it with the correct permissions to manage (in our case, delete) the resources.#!/usr/bin/env bashGCP_PROJECT_ID=$(gcloud config get-value project)gcloud iam service-accounts create sa-zuna gcloud iam service-accounts keys create sa-key.json --iam-account=sa-zuna@${GCP_PROJECT_ID}.iam.gserviceaccount.com These commands create a service account that lives in your project (identified by sa-zuna@<your-project-id>.iam.gserviceaccount.com). Next, it crafts a public-private key pair of which the private part is downloaded into the file sa-key.json. This file can now be used to authenticate your script, as we will see in the next section.First, let’s make sure that we have the correct permissions to list and remove subscriptions. Create the following file called zuna-role-definition.yaml:title: "ZUNA"description: "Permissions for ZUNA."stage: "ALPHA"includedPermissions:- pubsub.subscriptions.list- pubsub.subscriptions.deleteNext, execute the following script:#!/usr/bin/env bashGCP_PROJECT_ID=$(gcloud config get-value project)# Create a role specifically for ZUNA inside the projectgcloud iam roles create zuna --project=${GCP_PROJECT_ID} \ --file=zuna-role-definition.yaml# Bind the role to the ZUNA SA on a project levelgcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} \ --member="serviceAccount:sa-zuna@${GCP_PROJECT_ID}.iam.gserviceaccount.com" \ --role="projects/${GCP_PROJECT_ID}/roles/zuna"The script creates a new role, specifically for our application ZUNA, with the two permissions we need. The role definition lives inside our project (this is important when referencing the role). Next, the role is assigned to the service account on a project level. This means that the permissions apply to all the subscription resources that live inside our project.Note: It’s also possible to assign pre-defined roles to the service account. Still, we opt for a specific custom role as the pre-defined ones would grant unnecessary permissions, e.g., consuming from a subscription. This way of working is in line with the principle of least privilege. Step 3: Delete Resources in Python It is time to remove our resource using a Python script! You can quickly setup a Python 3 virtual environment as follows:#!/usr/bin/env bashvirtualenv -p python3 venvsource venv/bin/activatepip install google-cloud-pubsub==2.4.0Now you can create a python file clean_subscriptions.py with the following contents:import osfrom google.cloud import pubsub_v1def clean_pubsub_subscriptions(project_id, delete=False): # Subscriptions # see: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/efe5e78451c59415a7dcaaf72db77b13085cfa51/pubsub/cloud-client/subscriber.py#L43 client = pubsub_v1.SubscriberClient() project_path = f"projects/{project_id}" to_delete = [] # Go over ALL subscriptions in the project for subscription in client.list_subscriptions(request={"project": project_path}): # Collect those with the correct label if subscription.labels['autodelete'] == 'true': print(f'Subscription {subscription.name} is up for removal') to_delete.append(subscription.name) else: print(f'Skipping subscription {subscription.name} (not tagged for removal)') # Remove subscriptions if needed if delete: for subscription_name in to_delete: print(f'Removing {subscription_name}...') client.delete_subscription(subscription=subscription_name) print(f'Removed {len(to_delete)} subscriptions') else: print(f'Skipping removal of {len(to_delete)} subscriptions') client.close()if __name__ == "__main__": project_id = os.environ['GCP_PROJECT_ID'] clean_pubsub_subscriptions(project_id, False)Conceptually, the following happens:the script uses your project id to fetch all the subscriptions in the project it keeps only the subscriptions based on the label autodelete: true it attempts to remove all these subscriptionsNote that the actual removal is still disabled for safety reasons. You can enable it by setting the last line to clean_pubsub_subscriptions(project_id, True).You can run the script as follows:#!/usr/bin/env bashsource venv/bin/activateexport GCP_PROJECT_ID=$(gcloud config get-value project)export GOOGLE_APPLICATION_CREDENTIALS=sa-key.jsonpython clean_subscriptions.pyBecause we make use of Google’s Python client library, we can pass in our service account using the GOOGLE_APPLICATION_CREDENTIALS environment variable. The script will then automatically inherit the roles/permissions we assigned to the service account.The output of the script should resemble the following:Skipping subscription projects/jonnys-project-304716/subscriptions/test-zuna-subscription-dontremove (not tagged for removal)Subscription projects/jonnys-project-304716/subscriptions/test-zuna-subscription is up for removalSkipping removal of 1 subscriptionsThat’s correct: only one of our two subscriptions is up for removal. Now let’s move this to GCP! Step 4: Wrap in a Cloud Function We can easily wrap the previous section’s script in a Cloud Function. A Cloud Function is a piece of code that can be triggered using an HTTP endpoint or a Pub/Sub message. We’ll choose the latter as Cloud Scheduler can directly post messages to Pub/Sub: an ideal combination!import base64import osfrom clean_subscriptions import clean_pubsub_subscriptions# see: https://cloud.google.com/functions/docs/tutorials/pubsub#functions-prepare-environment-pythondef app_zuna(event, context): """ Background Cloud Function to be triggered by Pub/Sub. Args: event (dict): The dictionary with data specific to this type of event. The `data` field contains the PubsubMessage message. The `attributes` field will contain custom attributes if there are any. context (google.cloud.functions.Context): The Cloud Functions event metadata. The `event_id` field contains the Pub/Sub message ID. The `timestamp` field contains the publish time. """ print("""This Function was triggered by messageId {} published at {} """.format(context.event_id, context.timestamp)) if 'data' in event: payload = base64.b64decode(event['data']).decode('utf-8') else: payload = 'N/A' print('ZUNA started with payload "{}"!'.format(payload)) run_cleanup_steps()def run_cleanup_steps(): project_id = os.environ['GCP_PROJECT_ID'] print("ZUNA project:", project_id) print("Cleaning Pub/Sub Subscriptions...") clean_pubsub_subscriptions(project_id=project_id, delete=True) print("Pub/Sub Subscriptions cleaned!") # TODO Clean-up Pub/Sub Topics # TODO Clean-up BigQuery Datasetsif __name__ == "__main__": run_cleanup_steps()This code should be placed in main.py and is a simple wrapper for our function. You can test it locally by running python main.py. You'll notice from the output that our function from the previous step is executed; we also reserved some space for future resources (the# TODO lines).The additional function app_zuna will be the Cloud Function's entry point. Currently, it just prints the payload it receives from Pub/Sub and subsequently calls the cleanup function. This makes it behave similar to the local execution.Deploying can be done with the following script:#!/usr/bin/env bashGCP_PROJECT_ID=$(gcloud config get-value project)source venv/bin/activate# Deploy the functiongcloud functions deploy \ app-zuna \ --entry-point=app_zuna \ --region=europe-west1 \ --runtime python38 \ --service-account=sa-zuna@${GCP_PROJECT_ID}.iam.gserviceaccount.com \ --trigger-topic app-zuna-cloudscheduler \ --set-env-vars GCP_PROJECT_ID=${GCP_PROJECT_ID} \ --timeout=540s \ --quiet Note: You might want to change the region to something more suitable for your situation.Several important highlights:we refer to the Python function app_zuna to make sure this function is called when the Cloud Function is hit the service account we created earlier is used to execute the Cloud Function; this means that when our code runs in the cloud, it inherits the permissions assigned to the service account! the trigger topic refers to a Pub/Sub topic that the Cloud Function will “listen” to; whenever a message appears on there, the function will process it; the topic is created automatically the environment variable for the project is included so both local and remote (cloud) versions can operate identically the timeout is set to 9 minutes, which is the maximum at the time of writing; we set it this high as removal might take some time in the current non-parallel setupWhen you run this script, gcloud will package up your local resources and send them to the cloud. Note that you can exclude specific resources using the.gcloudignore file, which is created when you run the command the first time. When the upload completes, a Cloud Function instance is created that will run your code for every message that appears in the Pub/Sub topic.#!/usr/bin/env bashgcloud pubsub topics publish app-zuna-cloudscheduler \ --message="Hello ZUNA!"Note: In case you get an error that resembles Cloud Functions API has not been used in project ... before or it is disabled. you still need to enable the Cloud Functions API in your project. This can easily be done with the following commands or using the Cloud Console (see this documentation):gcloud services enable cloudfunctions.googleapis.comgcloud services enable cloudbuild.googleapis.comYou can easily test the cloud function by sending a message to the newly created Pub/Sub topic:Or in the Cloud Console using the “Publish Message” option directly on the topic: Send data to the Pub/Sub topic to test the Cloud Function.You can view the logs using gcloud:$gcloud functions logs read app-zuna --region=europe-west1D app-zuna 3brcusf1xgve 2021-02-28 20:37:12.444 Function execution started app-zuna 3brcusf1xgve 2021-02-28 20:37:12.458 This Function was triggered by messageId 2041279672282325 published at 2021-02-28T20:37:10.501Z app-zuna 3brcusf1xgve 2021-02-28 20:37:12.458 app-zuna 3brcusf1xgve 2021-02-28 20:37:12.458 ZUNA started with payload "Hello ZUNA"! app-zuna 3brcusf1xgve 2021-02-28 20:37:12.458 ZUNA project: jonnys-project-304716 app-zuna 3brcusf1xgve 2021-02-28 20:37:12.458 Cleaning PubSub Subscriptions... app-zuna 3brcusf1xgve 2021-02-28 20:37:13.139 Skipping subscription projects/jonnys-project-304716/subscriptions/test-zuna-subscription-dontremove (not tagged for removal) app-zuna 3brcusf1xgve 2021-02-28 20:37:13.139 Subscription projects/jonnys-project-304716/subscriptions/test-zuna-subscription is up for removal app-zuna 3brcusf1xgve 2021-02-28 20:37:13.139 Skipping subscription projects/jonnys-project-304716/subscriptions/gcf-app-zuna-europe-west1-app-zuna-cloudscheduler (not tagged for removal) app-zuna 3brcusf1xgve 2021-02-28 20:37:13.139 Removing projects/jonnys-project-304716/subscriptions/test-zuna-subscription... app-zuna 3brcusf1xgve 2021-02-28 20:37:17.869 Removed 1 subscriptions app-zuna 3brcusf1xgve 2021-02-28 20:37:17.945 PubSub Subscriptions cleaned!D app-zuna 3brcusf1xgve 2021-02-28 20:37:17.946 Function execution took 5505 ms, finished with status: 'ok'D app-zuna 9cbqexeyaie8 2021-02-28 20:38:49.068 Function execution started app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.147 [2021-02-28 20:38:50 +0000] [1] [INFO] Starting gunicorn 20.0.4 app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.148 [2021-02-28 20:38:50 +0000] [1] [INFO] Listening at: (1) app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.148 [2021-02-28 20:38:50 +0000] [1] [INFO] Using worker: threads app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.202 [2021-02-28 20:38:50 +0000] [6] [INFO] Booting worker with pid: 6 app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.232 This Function was triggered by messageId 2041279609298300 published at 2021-02-28T20:38:48.348Z app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.233 app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.233 ZUNA started with payload "Test ZUNA using Cloud Console"! app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.233 ZUNA project: jonnys-project-304716 app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.233 Cleaning PubSub Subscriptions... app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.851 Skipping subscription projects/jonnys-project-304716/subscriptions/test-zuna-subscription-dontremove (not tagged for removal) app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.851 Skipping subscription projects/jonnys-project-304716/subscriptions/gcf-app-zuna-europe-west1-app-zuna-cloudscheduler (not tagged for removal) app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.851 Removed 0 subscriptions app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.852 PubSub Subscriptions cleaned!D app-zuna 9cbqexeyaie8 2021-02-28 20:38:50.854 Function execution took 1787 ms, finished with status: 'ok'Or in the Cloud Console: Logs of our newly deployed Cloud Function.Notice how our print statements appear in the output: the Pub/Sub message payload is logged, as well as the informational messages about which subscriptions have been deleted. Step 5: Automate the Process We now have a fully functioning cleanup system; the only missing piece is automation. For this, we employ Cloud Scheduler, which is a managed cron-service, for those of you who are familiar with it.#!/usr/bin/env bashgcloud scheduler jobs create pubsub \ zuna-weekly \ --time-zone="Europe/Brussels" \ --schedule="0 22 * * 5" \ --topic=app-zuna-cloudscheduler \ --message-body "Go Zuna! (source: Cloud Scheduler job: zuna-weekly)"In this script, we create a new scheduled “job” that will publish the specified message to the Pub/Sub topic that our Cloud Function is listening to.Note that the timezone is set specifically for my use case. Omitting this would make it default to Etc/UTC. Hence, you might want to change this to accommodate your needs. The TZ database names on this page should be used.When creating the job, you might get a message that your project does not have an App Engine app yet. You should create one before continuing², but make sure you choose the correct region.Your output of the Cloud Scheduler job creation should look like this:name: projects/jonnys-project-304716/locations/europe-west1/jobs/zuna-weeklypubsubTarget: data: R28gWnVuYSEgKHNvdXJjZTogQ2xvdWQgU2NoZWR1bGVyIGpvYjogenVuYS13ZWVrbHkp topicName: projects/jonnys-project-304716/topics/app-zuna-cloudschedulerretryConfig: maxBackoffDuration: 3600s maxDoublings: 16 maxRetryDuration: 0s minBackoffDuration: 5sschedule: 0 22 * * 5state: ENABLEDtimeZone: Europe/BrusselsuserUpdateTime: '2021-03-07T12:46:38Z'Every Friday, this scheduler will trigger. But, we get the additional benefit of manual triggering. Option 1 is the following gcloud command:#!/usr/bin/env bashgcloud scheduler jobs run zuna-weeklyOption 2 is via the UI, where we get a nice RUN NOW button: Our Cloud Scheduler job, complete with a “RUN NOW” button.Both options are great when you’d like to perform that occasional manual cleanup. After execution, you should see the new output in your Cloud Function’s logs.Warning: Triggering a cleanup during a test run (e.g., integration tests) of your system might fail your tests unexpectedly. Moreover, Friday evening might not make sense for your setup if it can break things. You don’t want to get a weekly alert when you’re enjoying your evening. Be careful! Step 6: Cleanup Well, when you’re done testing this, you should cleanup, right? The following script contains the gcloud commands to cleanup the resources that were created above:#!/usr/bin/env bashGCP_PROJECT_ID=$(gcloud config get-value project)# Remove virtual environmentrm -rf venv# Remove Pub/Sub subscriptionsgcloud pubsub subscriptions delete test-zuna-subscriptiongcloud pubsub subscriptions delete test-zuna-subscription-dontremove# Remove Pub/Sub topicsgcloud pubsub topics delete test-zuna-topic# Cloud Schedulergcloud scheduler jobs delete zuna-weekly --quiet# Cloud Functiongcloud functions delete app-zuna --region=europe-west1 --quiet# Rolesgcloud iam roles delete zuna --project=${GCP_PROJECT_ID}# Service Accountgcloud iam service-accounts delete sa-zuna@${GCP_PROJECT_ID}.iam.gserviceaccount.com Warning: Some of the resources, such as service accounts or custom roles, might cause issues when you re-create them soon after deletion as their internal counterpart is not immediately deleted.That’s it. You’re all done now! Go and enjoy the time you gained, or continue reading to find out how much this setup will cost you. Pricing At the beginning of this tutorial, we stated that we chose these specific services to help keep the costs low. Let’s investigate the cost model to verify this is the case.Cloud Scheduler has three free jobs per month per billing account ($0.10 per additional job). [source] Cloud Functions has 2M free invocations per billing account ($0.40 per 1M additional invocations + resource cost). We would have around four function calls per month for our use case, which is negligible. Note that at the time of writing, the default memory of a Cloud Function is set to 256MB, which we can tune down to128MB using the deployment flag --memory=128. This adjustment will make every invocation even cheaper. [source] The first 10 gigabytes for Cloud Pub/Sub are free ($40 per additional TiB). Even as the messages are counted as 1000 bytes minimum, we are still in the ballpark of a few kilobytes per month. So again, a negligible cost. [source]Hence, even for a setup where the free tiers are not applicable anymore, we don’t expect a cost that is higher than $0.20 per month. Next Steps Adding more resource types would definitely be useful. Checkout the ZUNA repository for hooks to cleanup Dataflow jobs, Pub/Sub topics, BigQuery datasets & tables, etc.You could check when certain resources were created and build in an expiration time. This will significantly reduce the risk of interfering with test runs. It’s also good to know that some services have expiration built-in!³Terraforming this setup is also a good idea. In that case, it could automatically be part of your deployment pipeline. Conclusion We’ve set up a Cloud Function that scans your current project for resources labeled with autodelete: true and removes them. The Cloud Function only has limited permissions and is triggered periodically using Cloud Scheduler and a Pub/Sub topic.We succeeded in building an automatic system that we can also trigger manually. It’s flexible as we can easily add code to clean up other types of resources. It was quick to set up, especially since we kept the bash scripts (although terraforming it would be nice). The cost is low as the service usage is minimal, and all services use a pay-as-you-go model. Using this setup will probably keep you in the free tier.Finally, since the components that we used are generic, the resulting setup translates directly into a helpful blueprint for countless other automation use cases.Eliminating Toil — Site Reliability Engineering — https://sre.google/sre-book/eliminating-toil/ It seems Cloud Scheduler is still tied to App Engine, hence the requirement to have an App Engine app. Indeed, check out Pub/Sub Subscription expiration and BigQuery table expiration!Originally published at https://connectingdots.xyz on March 28, 2021.

Categories:Data AnalyticsDevOps and SREServerless

Tips for Building Better Machine Learning Models

  Photo by Ivan Aleksic on Unsplash Hello, developers! If you have worked on building deep neural networks, you might know that building neural nets can involve performing a lot of experimentation. In this article, I will share some tips and guidelines that I feel are pretty useful and can use to build better deep learning models, making it a lot more efficient for you to stumble upon a good network.Also, you may need to choose which of these tips might be helpful in your scenario; everything mentioned in this article could straight up improve your models’ performance. A High-Level Approach for Hyperparameter Tuning One of the painful things about training deep neural networks is the many hyperparameters you have to deal with constantly. These could be your learning rate α, the discounting factor ρ, and epsilon ε if you are using the RMSprop optimizer (Hinton et al.) or the exponential decay rates β₁ and β₂ if you are using the Adam optimizer (Kingma et al.). You also need to choose the number of layers in the network or the number of hidden units for the layers; you might be using learning rate schedulers and want to configure that and a lot more! We need ways to organize our hyperparameter tuning process better.A common algorithm I usually tend to use to organize my hyperparameter search process is the random search. Though there exist improvements to this algorithm, I typically end up using random search. Let’s say, for this example, you want to tune two hyperparameters, and you suspect that the optimal values for both would be somewhere between one and five. The idea here is to instead of picking 25 values to try out [like (1, 1) (1, 2), etc.] systematically, it would be more effective to select 25 points at random. Based on Lecture Notes of Andrew Ng Here is a simple example with TensorFlow where I try to use random search on the Fashion-MNIST dataset for the learning rate and the number of units:Radom Search in TensorFlow.I would not be talking about the intuition behind doing so in this article. However, you could read about it in this article I wrote some time back. Use Mixed Precision Training for Large Networks Growing the size of the neural network usually results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increase. While using mixed-precision training, according to Paulius Micikevicius and colleagues, the idea is to train deep neural networks using half-precision floating-point numbers to train large neural faster with no or negligible decrease in the performance of the networks. However, I would like to point out that this technique should be used only for large models with more than 100 million parameters.While mixed-precision would run on most hardware, it will only speed up models on recent NVIDIA GPUs, for example, Tesla V100 and Tesla T4 and Cloud TPUs. To give you an idea of the performance gains with using mixed precision when I trained a ResNet model on my Google Cloud Platform Notebook instance (consisting of a Tesla V100) I saw almost 3 times in the training time and almost 1.5 times on a Cloud TPU instance with near to no difference inaccuracies. To further increase your training throughput, you could also consider using a larger batch size (since we are using float16 tensors, you should not run out of memory).It is also rather easy to implement mixed precision with TensorFlow, the code to measure the above speed-ups was taken from this example. If you are looking for more inspiration to use mixed-precision training, here is an image demonstrating speedup for multiple models by Google Cloud on a TPU:Speedups on a Cloud TPU Use Grad Check for Backpropagation In multiple scenarios, I have had to custom implement a neural network, and usually implementing the backpropagation is the aspect prone to mistakes and is also difficult to debug. It could also occur that with an incorrect backpropagation, your model learns something that might look reasonable, thus making it even more difficult to debug. So, how cool would it be if we could implement something that could allow us to debug our neural nets easily?I often consider using gradient checks when implementing backpropagation to help me debug it. The idea here is to approximate the gradients using a numerical approach. If it is close to the calculated gradients by the backpropagation algorithm, then you could be more confident that the backpropagation was implemented correctly.As of now, you could consider using this expression in standard terms to get a vector, which we will call dθ[approx]Calculate approx gradientsIn case you are looking for the intuition behind this, you could find more about it in this article by me. So, now we have two vectors dθ[approx] and dθ (calculated by backprop). And these should be almost equal to each other. You could simply compute the Euclidean distance between these two vectors and use this reference table to help you debug your nets:Reference Table Cache Datasets Caching datasets is a simple idea but one I have not seen much to be used. The idea here is to go over the dataset in its entirety and cache the dataset either in a file or in memory (if it is a small dataset). This should save you from performing some expensive CPU operations like file opening and data reading from being executed during every single epoch. Well, this also means that your first epoch would comparatively take more time since you would ideally be performing all operations like opening files and reading data in the first epoch, and then you would cache them. The subsequent epochs should be a lot faster since you would be using the cached data in the subsequent epochs.This particularly seems like a very simple to implement the idea, indeed! Here is an example with TensorFlow showing how one can very easily cache datasets and also shows the speedup with implementing this idea: Example of Caching Datasets and the Speedup with It Common Approaches to Tackle Overfitting If you have worked on building neural networks, it is arguable overfitting or underfitting might be one of the most common problems you face. This section talks about some common approaches that I usually use when tackling these problems. You probably know this, but high bias will cause us to miss a relation between features and labels (underfitting), and high variance would cause capturing the noise and overfitting to the training data.I believe the most promising way to solve overfitting is to get more data, though you could also augment data. A benefit of deep neural networks is that their performance improves as they are fed more and more data. A benefit of very deep neural networks is that their performance continues to improve as they are fed larger and larger datasets. However, in a lot of situations, it might be too expensive to get more data (or simply infeasible to do), so let’s talk about a couple of methods you could use to tackle overfitting.Ideally, this is possible to do in two manners: either changing the architecture of the network or by applying some modifications to the network’s weights. A simple manner to change the architecture such that it doesn’t overfit would be to use random search to stumble upon a good architecture or try pruning nodes from your model. We already talked about random search, but in case you want to see an example of pruning, you could take a look at the TensorFlow Model Optimization Pruning Guide.Some common regularization methods I tend to try out are: Dropout: Randomly remove x% of input.  L2 Regularization: Force weights to be small reducing the possibility of overfitting. Early Stopping: Stop the model training when performance on the validation set starts to degrade.  Thank You Thank you for sticking together with me till the end. I hope you will benefit from this article and incorporate these in your own experiments. I am excited to see if it helps in improving the performance of your neural nets, too. If you have any feedback or suggestions for me, please feel free to send them over! @rishit_dagli 

Categories:AI and Machine LearningData AnalyticsDevOps and SRE

C2C Deep Dive: Demystifying Anthos with Google Cloud's Director of Outbound Product Management

Originally published on December 4, 2020.In this C2C Deep Dive, product expert Richard Seroter aimed to build the foundations of understanding with live Q&A. Here’s what you need to know:What is Anthos?  In its simplest form, Anthos is a managed platform that extends Google Cloud services and engineering practices to your environments so you can modernize apps faster and establish operational consistency across platforms. Why GCP and Anthos for app modernization? Responding to an industry shift and need, Google Anthos “allows you to bring your computing closer to your data,” Seroter said.So if data centers are “centers of data,” it's so helpful to have access to that data in an open, straightforward, portable way and to be able to do that at scale and consistently. Hear Seroter explain how this can help you consolidate your workloads.   First-generation vs. second-generation cloud-native companies: What have we learned?  The first generation was all about infrastructure automation and continuous delivery (CD) mindset at a time when there wasn’t much research into how to make it happen. So some challenges included configuration management, dealing with multi-platforms, or dealing with security.Now, as Richard Seroter explains more in this clip, the second generation is taking what has been learned and building upon it for sustainable scaling with a focus on applications.   Is Unified Hub Management possible through Anthos for the new generation? Yep. Anthos offers a single-management experience, so you can manage every Anthos cluster in one place, you can see what they’re doing, but you can push policy back to them, too. You can apply configurations and more to make it easy for billing and management experience.  Serverless anywhere? You bet. Use Cloud Run for Anthos.  Building upon the first generation of the platform as a service (PaaS), GCP brings Cloud Run for Anthos as a solution to needing more flexibility and building on a modern stack. Besides being Richard Seroter’s favorite, it balances the three vital paradigms existing today: PaaS, Infrastructure as a service (IaaS), and container as a service (CaaS).Watch the clip to hear Seroter explain the how and the why. What about a GitOps workflow and automation—is scaling possible? Yes, by using Anthos Configuration Management (ACM), policy and configuration are possible at scale. You can manage all cloud infrastructure, not just Kubernetes apps and clusters, and even run end-to-end audits and peer review. Watch to learn how this works.  Question from the community: Are capabilities Hybrid AI and BigQuery available for Anthos on-prem?With Hybrid AI for Anthos, Google offers AI/ML training and inferencing capabilities with a single click. Google Anthos also allows for custom AI model training and MLOps lifecycle management using virtually any deep-learning framework.  Prefer to watch the whole C2C Deep Dive on Application Development with Anthos?    

Categories:AI and Machine LearningApplication DevelopmentInfrastructureGetting Started with Google CloudDevOps and SREHybrid and MulticloudCloud OperationsServerlessSession Recording

5 Takeaways from C2C's Conversation with Google’s Pali Bhat

This article was originally published on October 28, 2020.In the fourth installment of its Rockstar Conversations series, C2C welcomed Pali Bhat, VP of product and design at Google, to discuss application modernization and building for the future. Bhat, who has been at Google for more than 10 years, has held various roles but none closer to his heart than his current one. “I love application development!” Bhat said. Among his responsibilities, Bhat is in charge of Google’s application modernization and developer solutions portfolio. He came ready to discuss all of it—including Anthos, Google Kubernetes Engine (GKE), and various other hot-topic issues—with C2C and its members.  “As you think about your applications, you’ll see they’re the heart of your business and how you serve customers,” Bhat said. “They will become more germane and central to everything that your business does. And so, it's really important to have a platform that empowers all of your technology and application development teams to be proactive and to not have to worry about infrastructure, while still being secure and compliant and meeting the needs of your business.”Here are five key takeaways from the engaging conversation that took place on the C2C Rockstar stage.      How Does Google Cloud Help Customers Solve Problems?Bhat explained his role at Google as one that needs to determine customer problems and then find ways to solve them. “I listen to what our customers are saying,” he noted. “And then I get to design solutions and bring them to market, and this is something I really enjoy doing.”In the past year, as the pandemic has made waves across industries and borders, Google has seen an increase in the number of customers looking to accelerate their move to the cloud and to modernize their applications and processes. To give you a sense of Google’s scale, in his keynote during Google Next 2020, Bhat noted that Google has deployed more than 4 million builds and 500 million test cases daily.“Those numbers are clearly impressive,” Bhat said during the Rockstar Conversation. “But this is an example of how customers are solving problems by taking advantage of the unprecedented scale that Google Cloud offers them.” Bhat offered an example with the GKE platform, which is Google’s secured and managed Kuberbetes service with four-way auto scaling and multi-cluster support. “We have it at three times the scale of any other open-source Kubernetes, and by far the largest scale of any cloud-based Kubernetes service,” he noted. “Today, within each Kubernetes cluster on GKE, we support more than 15,000 nodes. What that means is that customers can analyze all of their data, which they could not do manually, at scale. These are the kinds of solutions we can bring to customers.”    How Is Google Cloud Helping Developers Develop?Bhat stressed that a top priority for him and his team at Google Cloud is to empower developers to be able to move faster and focus on the applications they’re building without having to worry about the infrastructure.“We’ve all seen how development has evolved over the last 30, 40 years,” he said. “We’ve seen an increased focus on more tools and more framework. But with the complexity that has also grown in terms of business requirements, as well as what our customers need from us and from each of the applications we build, we've seen an increased focus on security and compliance as well.” As a result, Google Cloud is focused on providing developers the tools they need for security and compliance without slowing them down.“Google Cloud brought to the industry through Kubernetes a declarative model for managing how you build applications and how you deploy them,” Bhat said. “And what that ultimately gives you as a developer is the ability to operate at scale.” Another example Bhat highlighted is the work they are doing with how to deal with compliance and policy. “Today, the way you find out if you're in compliance is by doing an audit after the fact,” he said. “We’re going to work on automating all those things that have to do with security and compliance so that developers can simply focus on building their applications.”As Bhat looks to the future of development, he said it’s about getting to a point where you can have the cloud be your pair programmer. “When the cloud is your pair programmer, you get a really powerful ally to help you go faster and to be able to focus on exactly what your customers need, and not have to worry about all these other things.”    Are You Ready to Go to Google CAMP?Bhat introduced the Google Cloud Application Modernization Program (Google CAMP) during his keynote address at Google Next 2020. He said, “The path to elite performance is not about implementing a one-size-fits-all app modernization model. Rather, you need to make incremental, continuous improvement based on your organization’s needs. To be successful, you need to start with a deep understanding of where you stand and where to focus so you can maximize the impact of your modernization effort.”Google CAMP has three main components to help organizations become elite performers, including data-driven baseline assessment and benchmarking; a blueprint for proven, scalable DevSecOps practices; and a modern, yet extensible platform that enables you to build, run, secure, and manage both legacy and new applications.  During the Rockstar Conversation, Bhat said he has heard customers continuously ask for help in their application modernization journeys. “Google CAMP gives every one of our customers the ability to get a tailored application modernization program that meets their business needs. It is designed to help them go faster while also doing it more safely.” He added, “You often hear people say to go fast and break things. Well, our customers actually don’t want to break things. They’re powering some of the most critical and essential services on our planet, and we can help them move fast, safely.”    What Makes Anthos Stand Out as an Application Platform?In a one-on-one discussion I had with Bhat, he told me that Anthos will help in bringing the benefits of the cloud—whether they are on-premise, at the edge, or in the cloud, regardless of which cloud they are on—to all customers.During the Rockstar Conversation with C2C, he was asked what makes Anthos stand out as a top-tier application platform. He pointed to three distinct things. As mentioned to me, he noted that first and foremost it brings the cloud to customers. “Anthos is really the first solution that actually brings all of the power of cloud in a consumable way to wherever you need it.”The second thing that separates Anthos is this foundation of open technologies, but with managed services on top. “We've used a portfolio of open solutions such as Kubernetes and others that we've done within Google, and then we’ve built incredible managed services on top that simplify security, policy, governance, and compliance, so that you can actually continue to just focus on your application,” Bhat said. The third thing that sets Anthos apart from other application platforms is that it is the first hybrid multi-cloud platform that’s bringing value-added services on top. “As an example,” Bhat said, “we recently announced BigQuery Omni, which will bring BigQuery—our data analytics service—to other clouds. Bhat added, “Those three things put together give you the application platform that you need.”    Where Does the Application Modernization Storyline Go from Here?Bhat was asked if application modernization were a 10-episode Netflix series, what episode are we on? Without hesitation, Bhat responded, “episode three!” He explained, “I look at it as customers have gotten started and they really understand that this is important. They've dug into this series, and they now want to get to the finish line. They want to see how it turns out.”In terms of what’s up next, Bhat cautioned that, “ultimately, application modernization is something that has to be part of what you do as an organization because every technology evolves and gets better over time.” He added, “this can’t just be a 10-episode series, but rather something so popular that it’s going to get renewed, and we’re going to have many seasons.”As for advice he wanted to share, Bhat said that it is really important to pick the right platform so that you can actually evolve as your customer and business needs evolve. “You also need to pick the right platform so that your teams are investing in the technologies that are going to be durable. And finally, remember that although having the right set of platforms, technologies, and policies are critically important, the most important piece in all of this are your people. Celebrate your technology and application development teams because without the people, none of this would be possible.”

Categories:Application DevelopmentGetting Started with Google CloudGoogle Cloud StrategyDevOps and SREHybrid and Multicloud

One-on-One with Google Cloud's Pali Bhat

This article was originally published on October 15, 2020.Pali Bhat is VP of Product and Design at Google. As part of the leadership team at Google Cloud, he has had the opportunity to lead products that have had a significant impact for Google Cloud customers.Today, he is responsible for the cloud company’s application modernization and developer solutions portfolio. Specifically, Bhat is in charge of Anthos, Google Cloud’s hybrid and multi-cloud platform, as well as the company’s modern computing efforts, which include Google Kubernetes Engine, and Google Cloud’s serverless solutions—Cloud Run and Cloud Functions. Bhat also leads Google Cloud’s Developer Solutions—Cloud Code and Cloud Build. Although Bhat has had his hand in many of Google’s flagship products, he recently noted that the area he’s working on now—application modernization—is the most exciting for him. “When we look at application modernization, it's a journey that virtually every large enterprise around the world is currently underway with,” he said. “The application modernization path they are on is unique in a number of ways for each customer because each has different needs, and the starting points for many are quite different.” He added, “We’re very focused on ensuring that our solutions actually address the needs of all of our customers, while allowing them to bring along their own assets, regardless of where they are in their journey.”Pali BhatBringing the Value and Building for the FutureAs developers code software and applications to solve business problems for the future, they’re faced with the reality that today the cloud is still too complex a platform to develop for. “They’re not just developing an application,” Bhat said. “They need to then deploy and operate that application as well. That process can be complex.”Bhat and his team are focusing on taking a different approach so that the solution they provide acts almost like an autopilot and does the hard work, whether it's infrastructure-related, deploying an application, or pushing it and managing it to production. “With Google Cloud taking care of this aspect of the solution, technology teams can instead focus solely on innovating and on the act of creating business value for their customers.”Looking forward, he noted the value will be on bringing the benefits of the cloud—whether they are on-premise, at the edge, or in the cloud and regardless of which cloud they are on—to all customers. “That’s what we’re focused on now with Anthos” he said. “And what I believe will happen in the next few years is that the act of developing an application and then delivering it to your customers will get much simpler, while providing a more robust environment to manage the application after deployment.”Democratizing at Scale Benefits EveryoneLike others in Google Cloud’s leadership (i.e., Andrew Moore), Bhat also believes in the power of democratizing applications. When asked what he’s most passionate about solving, he answered simply, “The biggest opportunity for cloud customers is to both leverage large-scale computing, storage, and network resources, and also to democratize access to value-added capabilities such as data analytics and machine learning, among others.”By doing this, Bhat emphasized that customers will not only be able to innovate faster, but also be as operationally efficient as they can be. “The number one thing that keeps me up at night,” Bhat noted, “is how to get these capabilities into our customers’ hands quickly. It’s still too hard for customers to adopt cloud, and making this simpler is my priority.”A Call to Unite and Work TogetherBhat sits on the board for C2C, which is the independent Google Cloud community. It’s not that he doesn’t already have a full plate in his role at Google Cloud, but he sees this as an opportunity to connect and work with customers and better understand how to make the solutions he creates more accessible and approachable.“When I think of C2C, there is of course the community aspect of it where customers can network with and help each other,” he said. “But it can do more than that, too. As cloud gets more broadly adopted, I see an opportunity where our customers can take advantage of all of the different services that the broader community is producing in order for all of us to get better together. And I see C2C as not just a place where customers can learn, but be a place where customers actually find other services that help them build the things that they want to deliver for their customers and their business.”

Categories:Application DevelopmentCareers in CloudDevOps and SREHybrid and MulticloudInterview