LLama2 70b on GC | C2C Community

LLama2 70b on GC

  • 26 September 2023
  • 8 replies

Userlevel 3
Badge +2

Hi Community, I am hoping you can help me out. I recently read that I can easily use vortex to launch Llama and it shouldn’t be difficult. Has this been others experience? I was recently told that it would be a long process to fine tune Llama2 70b but keep reading other articles of people making it happen in matter of hours. Any experience with this? Also, how can I figure out the hourly costs to host Llama2 on GC?

I appreciate any feedback you can offer.





Best answer by malamin 28 September 2023, 02:37

View original

8 replies

Userlevel 7
Badge +17

To figure out the hourly costs of hosting a specific model on Google Cloud (GC), you would need to consider various factors, such as the instance type, storage requirements, and network usage. GC provides a pricing calculator on their website which can help you estimate the costs based on your specific configuration. You can access the Google Cloud pricing calculator at https://cloud.google.com/products/calculator

Userlevel 3
Badge +2

Thanks Kate on your quick response. Any thoughts on my other questions?

Userlevel 7
Badge +29

@Nelson, you were trying to work around a bit with Llama2… Can you perhaps drop a piece of advice here? Thanks! 

Userlevel 7
Badge +35

Hello @ChambiarGirl ,

Thank you for open this discusstion here. It’s really interesting and Fine-tuning Llama2 70b in a matter of hours is still a challenge, but it is becoming increasingly possible with the latest advances in hardware and software. For example, the Hugging Face Transformers library provides a number of techniques for efficient fine-tuning, such as QLoRA and FlashAttention 2.

One way to fine-tune Llama2 70b in a matter of hours is to use a technique called QLoRA, which stands for Low-Rank Adaptation of Large Language Models. QLoRA works by attaching small, low-rank adapters to the pre-trained model. These adapters are then fine-tuned on the downstream task, which is much faster than fine-tuning the entire model.

Another way to fine-tune Llama2 70b in a matter of hours is to use a technique called parameter-efficient fine-tuning. Parameter-efficient fine-tuning works by only fine-tuning the parameters of the model that are most important for the downstream task. This can be done by using techniques such as prompt tuning and adapter tuning.

Here are some tips for fine-tuning Llama2 70b in a matter of hours in Google Cloud Vertex AI:

  • Google Cloud offers a variety of hardware options, including GPUs and TPUs. GPUs are a good option for training large language models, but TPUs are even faster and more efficient.
  • Cloud TPUs: Cloud TPUs are specialized hardware accelerators that are designed for training machine learning models. They can significantly speed up the training process of large models like the Llama2 70b model
  • Use multi-worker training. Multi-worker training allows you to train a model on multiple machines simultaneously, which can significantly reduce the training time. Vertex AI makes it easy to set up multi-worker training jobs, and you can choose the number of workers to use based on your budget and the resources you have available.
  • Use a pre-processed dataset. This will save time on data loading and pre-processing.
  • Use a low learning rate. This will help to prevent the model from overfitting the training data.
  • Use a shorter training schedule. You may not need to train the model for as long as you would think in order to achieve good results.
  • Use a distributed training framework PyTorch FSDP or TensorFlow's MultiWorkerMirroredStrategy and Horovod, allow you to train your model on multiple GPUs or TPUs in parallel. This can significantly reduce the training time of large language models. This will allow you to split the training data across multiple TPUs, which can significantly speed up the training process and reduce the cost of training.
  • Use a checkpointing strategy. This will allow you to save the state of the model at regular intervals so that you can resume training from where you left off if the training process fails for any reason. This can save you time and money, as you will not have to start the training process over from the beginning.
  • Use a mixed precision training strategy. This will allow you to use a lower precision for some of the computations, which can improve the performance of the training process without sacrificing too much accuracy. This can save you money, as you will not need to use as many TPUs to train the model.
  • Use pre-trained weights. If you are able to find pre-trained weights for the Llama2 70b model, you can use them to initialize your model. This can significantly reduce the training time and cost.
  • Choose the right TPU pod. Google Cloud offers a variety of TPU pods that you can use to train the Llama2 70b model. Some TPU pods are more cost-effective than others. For example, the Cloud TPU v4 pod is more cost-effective than the Cloud TPU v3 pod.

Please note:

  • Consider using Cloud TPU Pod Sharing. Cloud TPU Pod Sharing allows you to share a Cloud TPU pod with other users. This can help to reduce the cost of training your model.
  • Use Cloud TPU preemptible VMs. Cloud TPU preemptible VMs are Cloud TPU VMs that can be taken away from you if someone else needs them. However, Cloud TPU preemptible VMs are significantly cheaper than Cloud TPU reserved VMs. If you can afford to have your training process interrupted, then Cloud TPU preemptible VMs can be a good way to save money.
  • Use Cloud TPU custom machine learning (CML) images. Cloud TPU CML images are custom machine learning images that you can create and use to train your model. Cloud TPU CML images can help you to improve the performance and cost-effectiveness of your training process.

If you need help with training or deploying the Llama2 70b model in Google Cloud TPU, you can contact the Google Cloud support team. Also, Read the following blog post and watch the introductory video and episode video , efficient fine tuning to learn more about how to optimize your model training for faster results.


To figure out the hourly costs to host Llama2 on Google Cloud , you will need to basically consider the following factors:

  • Model size: Larger models will typically cost more to host than smaller models.
  • Model complexity: More complex models will also typically cost more to host than simpler models.
  • Prediction volume: The number of predictions that you make per hour will also affect your costs.
  • Region: The region where you host your model will also affect your costs.

Once you have considered these factors, you can use the pricing calculator and Vertex AI pricing calculator to estimate your hourly costs.

Here is an example of how to use the pricing calculator to estimate the hourly costs to host Llama2 on Vertex AI:

Model size: 10 GB
Model complexity: Medium
Prediction volume: 100,000 predictions per hour
Region: US-central1


I hope this information helps you make the right decision.

Userlevel 3
Badge +2

@malamin I have one follow up question and hoping you can help. 
Currently I am hosting llama 2 7b on vertex ai. I am calling the endpoint through postman with the following json as the body:
“instances”: [
        {“prompt”:  “hello world” }

I am getting a good response from the model, but I am unable to find good documentation for what other parameters I can include in the body. For example, how would I add a system prompt?

Userlevel 7
Badge +35

Hello @ChambiarGirl,

Thank you for the follow up. To include a system prompt, you can update the JSON body as follows:


  instances: [
      prompt: hello world,
      inputs: {
        system: This is the system prompt.



"instances": [
"prompt": "hello world",
"system": "This is a system prompt."


import requests

# Replace the following values with your own values.
ENDPOINT_ID = "your endpoint"
REGION = "us-central1"
PROJECT = "my-project"

# Create the request body.
request_body = {
  "instances": [
      "systemParameters": {
        "project": PROJECT,
        "region": REGION
      "inputs": {
        "text": "This is a test."

# Make the request to the Vertex endpoint.
response = requests.post(f"https://{REGION}-aiplatform.googleapis.com/v1/endpoints/{ENDPOINT_ID}/predict", json=request_body)

# Check the response status code.
if response.status_code == 200:
  # The request was successful.
  # The request failed.

Also, could you check the following link for the details documention, it might be help you about how can you pass others additional parameters into end point request.






Userlevel 3
Badge +2

@malamin As always, your information was super helpful. Thank you so much for all of your help. This is all new to my team and I so we are figuring things as we go. We’ve recently come up with the issue of how to do unsupervised learning for Llama2 chat on Vertex AI.  It looks like only supervised and RLHF learning are supported on Vertex AI. To do unsupervised learning, would I just split the unstructured text at random locations into input/output pairs? I can't find much direction for this anywhere online, so if you know of any resources I would appreciate it.

Userlevel 7
Badge +35

@ChambiarGirl, Thank you for your compliment. You are correct that Vertex AI currently only supports supervised and RLHF learning for custom training. However, there are a few ways to do unsupervised learning on Vertex AI:

AutoML Tables can be used to train unsupervised models for tasks such as anomaly detection and clustering. To do this, you simply need to provide AutoML Tables with your unstructured text data and specify the task that you want to perform. AutoML Tables will then automatically train a model for you and provide you with the results.

Model Garden provides a variety of pre-trained unsupervised learning models that you can deploy on Vertex AI. These models can be used for tasks such as text clustering, topic modeling, and anomaly detection. To deploy a Model Garden model to Vertex AI, you can use the Vertex AI UI or the Vertex AI SDK.

You can also use custom training to train your own unsupervised learning models on Vertex AI. To do this, you will need to write your own training code and choose a suitable ML framework. Once you have trained your model, you can deploy it to Vertex AI using the Vertex AI SDK.

To send unsupervised learning requests to your Vertex AI custom prediction endpoint, you can use the following Python code:


import requests

# Set the endpoint URL
ENDPOINT_URL = "https://us-central1-aiplatform.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/us-central1/endpoints/YOUR_ENDPOINT_ID"

# Set the unsupervised learning request body
requestBody = {
"inputs": [
"text": "This is an example of an unsupervised learning request."

# Send the request
response = requests.post(ENDPOINT_URL, json=requestBody)

# Get the response body
responseBody = response.json()

# Print the response


Using unsupervised learning with Llama2 Chat on Vertex AI, you can follow these steps:

  • Create a Vertex AI custom training job.
  • Select the "Custom container" training option.
  • Upload your Llama2 Chat model to Vertex AI.
  • Create a Python script that implements the unsupervised learning algorithm you want to use.
  • Submit your training job.

Once your training job is complete, you can deploy your unsupervised learning model to Vertex AI Prediction.

Here is an example of a Python script that implements the K-means clustering algorithm:

import numpy as np
from sklearn.cluster import KMeans

class KMeansClusterer:
def __init__(self, n_clusters):
self.kmeans = KMeans(n_clusters=n_clusters)

def fit(self, X):

def predict(self, X):
return self.kmeans.predict(X)

# Load the Llama2 Chat model
model = Llama2Chat.from_pretrained("google/llama2-7b-chat-ggml")

# Create a KMeans clusterer
kmeans = KMeansClusterer(n_clusters=3)

# Fit the clusterer to the Llama2 Chat embeddings
embeddings = model.encode(corpus)

# Predict the cluster labels for new data points
new_embeddings = model.encode(["Hello, world!", "I am a large language model.", "What is the meaning of life?"])
cluster_labels = kmeans.predict(new_embeddings)

# Print the cluster labels


If you are new to unsupervised learning, I recommend that you start with AutoML Tables or Model Garden. These options are relatively easy to use and can help you to get started with unsupervised learning quickly.

Also, I will higly recommend you to go through Generative AI on Google Cloud free training and Google Cloud qwiklabs arcade monthly program without any cost and you will receive special Google Cloud Swags. That will give you practical hand on knowledge based on latest Google Cloud AI implementaiton.


Here are some additional resources that you may find helpful:

I hope this information will helpful for you.