January 23, 2023
Global Vertex AI COE lead
Vertex AI Outbound Product Manager
Secure and enable Vertex AI platform as your end to end ML/AI platform for production workloads.
An increasing number of Enterprise customers are adopting ML/AI as their core transformational pillars, in order to differentiate, increase revenue, reduce costs and maximize efficiency. For many customers ML/AI adoption can be a challenging endeavor not only because of the broad spectrum of applications ML/AI can support, deciding on which one to prioritize can be a challenge, but because moving these solutions into production require a series of security, access and data assessments and features that some ML/AI platforms might not have. This blog post focuses on how to set up your Cloud foundations to cater specifically to the Vertex AI platform and its configuration to be able to set up proper Vertex AI foundations for your future machine learning operations (MLOps) and ML/AI use cases.
Explainability is not covered in this blog post, but as a practitioner it is one of the key components for any production ready ML system to take it into account. You can take a look at Vertex Explainable AI for a more in depth approach on feature based explanations, feature attributions methods (Sampled Shapley, Integrated methods and XRAI) and differentiable and non-differentiable models.
Vertex AI currently comprises more than 22 services, so for simplicity we will cover the five core services to get your end-to-end machine learning process enabled and enterprise ready:
Vertex AI Workbench
Vertex AI Feature Store
Vertex AI Training
Vertex AI Prediction
Vertex AI Pipelines
Vertex AI reference Enterprise Networking Architecture
One of the key components is to understand how you should establish your development, user acceptance testing/Quality (UAT/QA) and Production environments. It's clear that as you move from one environment to another you will want to restrict external access and automate as much as possible. It's at this point when ML/AI starts to become very similar in the way it publishes code into production to the software development lifecycle. If you are familiar with DevOps Research and Assessment (DORA) or Development Operations (DevOps) and Machine Learning Operations (MLOPs) you can see how continuous integration and continuous delivery are applicable across both frameworks to ship and build code continuously, securely and reliably.
Vertex AI Machine Learning Operations
Machine learning operations borrow many elements from DevOps when it comes to ensuring an automated and reliable way of shipping software across multiple environments. Different companies might be in a different stage of their MLOps journey however, according to many research studies like the one from IDC, much of the Return On Investment (ROI) for ML/AI projects lies in moving into production.
DevOps is a popular practice in developing and operating large-scale software systems. This practice provides benefits such as shortening the development cycles, increasing deployment velocity, and dependable releases. To achieve these benefits, you introduce two concepts in the software system development:
An ML system is a software system, so similar practices apply to help guarantee that you can reliably build and operate ML systems at scale. However it is important to understand the difference between DevOps and MLOps (“MLOps: Continuous delivery and automation pipelines in machine learning | Cloud Architecture Center”):
Team skills: In an ML project, the team usually includes data scientists or ML researchers, who focus on exploratory data analysis, model development, and experimentation. These members might not be experienced software engineers who can build production-class services.
Development: ML is experimental in nature. You should try different features, algorithms, modeling techniques, and parameter configurations to find what works best for the problem as quickly as possible. The challenge is tracking what worked and what didn't, and maintaining reproducibility while maximizing code reusability.
Testing: Testing an ML system is more involved than testing other software systems. In addition to typical unit and integration tests, you need data validation, trained model quality evaluation, and model validation.
Deployment: In ML systems, deployment isn't as simple as deploying an offline-trained ML model as a prediction service. ML systems can require you to deploy a multi-step pipeline to automatically retrain and deploy models. This pipeline adds complexity and requires you to automate steps that are manually done before deployment by data scientists to train and validate new models.
Production: ML models can have reduced performance not only due to suboptimal coding, but also due to constantly evolving data profiles. In other words, models can decay in more ways than conventional software systems, and you need to consider this degradation. Therefore, you need to track summary statistics of your data and monitor the online performance of your model to send notifications or roll back when values deviate from your expectations.
Vertex AI Workbench - User Managed
User-Managed Notebooks are Deep Learning VM Images instances with JupyterLab notebook environments enabled and ready for use. This page describes how to upgrade the environment of a user-managed notebooks instance. When running a Jupyter notebook in Google Cloud on a User-Managed Notebook, your instance runs on a virtual machine (VM) managed by Vertex AI Workbench. From the Jupyter Notebook, you can access BigQuery and Google Cloud Storage data. For added security, you can run a Shielded VM as your computer instance for Workbench Notebooks. Log streaming to the consumer project via Logs Viewer is supported. You can also include the notebooks API to any legacy VM you might have following this guide.
All data at rest in GCP is encrypted. By default, this encryption will use a Google-managed key, though for greater control some customers use a Customer-Managed Encryption Key (CMEK), or even provide the key from their own Hardware Security Module (HSM) using our External Key Manager. Data in transit is also encrypted by default using TLS.
Additionally, we recommend you use Google Cloud Client Libraries for your application. Google Cloud Client Libraries use a library called Application Default Credentials (ADC) to automatically find your service account credentials. Another recommendation is that you use custom service accounts instead of predefined or basic roles.
Vertex AI Feature Store
Vertex AI Feature Store provides a centralized repository for organizing, storing, and serving ML features. Using a central feature store enables an organization to efficiently share, discover, and re-use ML features at scale, which can increase the velocity of developing and deploying new ML applications.