I'm starting to work with Google Vertex and I had this question regarding the scalability of the endpoints.
When I configure an endpoint I must:
1) set hardware for the training machine (n1-standard-2, for example);
2) set a minimum and maximum number of computer nodes.
Number 1 is mandatory and number 2 is optional (if configured, new nodes will only be instantiated with the configuration defined in item 1).
What is unclear to me is when I should invest in a larger hardware configuration (1) and when I should invest in a larger number of nodes (2).
Does anyone have an idea?
Best answer by lmmView original
@Yuri Matelli Calazans Luz! I hope I can share some thoughts to help you here :)
The combination of machine type vs. number of nodes is similar to the analysis you need to do when you are putting an regular application in production - you must experiment what is the minimum hardware required to run your application and also you must consider how the infrastructure must behave when your application usage increases (what we usually call "autoscaling").
Going back to Vertex endpoints - when you publish a model into an endpoint, basically you are using a mechanism that will load your model, instantiate a REST API, receive inference requests via REST API calls, run the inference on the trained model and deliver the inference result via a REST API response.
So first you need to consider your model size and complexity - what is the minimum hardware requirements to give to my users the best response time for inferences? This analysis will guide you to decide the minimum resources to run your model (and pick one initial machine configuration).
Secondly you must analyze your model usage behavior - Do you know how it is used? Can you predict (pun not intended) which times (times of a day, days of a week, weeks in a month) it will be more used or the scenario is more chaotic? Having that in mind you will need to think on the model availability vs. your cost plan:
The ideal scenario is to keep the better machines (by the configuration perspective) used as much as possible (in % of resources used - avoiding idle machines) in the most part of t ime. If you can't predict that, then the minimum and maximum number of nodes may help you: You define the minimum number of servers running the endpoint that must be on (let's say, one server) and then the maximum number of nodes will be spin up as needed (first server is at this maximum capacity, then spin up a second; then a third later and so on). And when the traffic decreases, the extra machines will be spin off one by one until it gets to the minimum number of nodes again.
A good extra side of it is - You can keep analyzing your environment usage and keep tweaking those nobs, to find the best configuration to your production.