Optimize PyTorch training performance with Reduction Server on Vertex AI | C2C Community

Optimize PyTorch training performance with Reduction Server on Vertex AI

Categories: AI and Machine Learning

March 16, 2023

 

Nikita Namjoshi

Developer Advocate

Eric Dong

Developer Programs Engineer

 

As deep learning models become increasingly complex and datasets larger, distributed training is all but a necessity. Faster training makes for faster iteration to reach your modeling goals. But distributed training comes with its own set of challenges.

On top of deciding what kind of distribution strategy you want to use and making changes to your training code, you need a way to manage infrastructure, optimize usage of your accelerators, and deal with limited bandwidth between nodes. This added complexity can slow your progress.

In this post, we’ll show you how to speed up training of a PyTorch + Hugging Face model using Reduction Server, a Vertex AI feature that optimizes bandwidth and latency of multi-node distributed training on NVIDIA GPUs for synchronous data parallel algorithms.

 

 

Overview of Distributed Data Parallel

 

Before diving into the details of Reduction Server and how to submit jobs on the Vertex AI training service, it’s useful to understand the basics of distributed data parallelism. Data parallelism is just one way of performing distributed training and can be used when you have multiple accelerators on a single machine, or multiple machines each with multiple accelerators.

To get an understanding of how data parallelism works, let’s start with a linear model. We can think of this model in terms of its computational graph. In the image below, the matmul op takes in the X and W tensors, which are the training batch and weights respectively. The resulting tensor is then passed to the add op with the tensor b, which is the model’s bias terms. The result of this op is Ypred, which is the model’s predictions.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_PyTorch_training.0995065416410811.max-2000x2000.jpg

We want a way of executing this computational graph such that we can leverage multiple workers. One way we might do this is by splitting the input batch X in half, and sending one slice to GPU 0 and the other to GPU 1. In this case, each GPU worker calculates the same ops but on different slices of the data.

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_PyTorch_training.1060064018780877.max-2000x2000.jpg

Adding this additional worker allows us to double the batch size. Each GPU gets a separate slice of data, they calculate the gradients, and these gradients are averaged. So effectively, with two GPUs your batch size becomes 64, and with 4 GPUs it would become 128. By adding more GPUs, your model sees more data on each training step. Which means that it takes less time to finish an epoch, which is just a full pass through the training data. And this is the core idea of data parallelism.

But, we’ve glossed over a key detail here. If both workers calculate the gradients on a different slice of data, then they will compute different gradients. So at the end of the backwards pass, we now have two different sets of gradients.

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_PyTorch_training.1000066120000920.max-2000x2000.jpg

When you’re doing synchronous data parallel training, you want to take these multiple sets of gradients and turn them into one set. We’ll do this by averaging the gradients in a process known as AllReduce, and use these averaged gradients to update the optimizer.

In order to compute the average, each worker needs to know the values of the gradients computed by all other workers. We want to pass this information between these nodes as efficiently as possible and use as little bandwidth as possible. There are many different algorithms for efficiently implementing this aggregation, such as Ring AllReduce, or other tree based algorithms. On Vertex AI, you can use Reduction Server, which optimizes bandwidth and latency of multi-node distributed training on NVIDIA GPUs for synchronous data parallel algorithms.

To summarize, a distributed data parallel setup works as follows:

  • Each worker device performs the forward pass on a different slice of the input data to compute the loss.

  • Each worker device computes the gradients based on the loss function.

  • These gradients are aggregated (reduced) across all of the devices.

  • The optimizer updates the weights using the reduced gradients, thereby keeping the devices in sync. 

 

 

Vertex AI Reduction Server

 

Note that while data parallelism can be used to speed up training across multiple devices on a single machine, or multiple machines in a cluster, Reduction Server works specifically in the latter case.

Vertex Reduction Server introduces an additional worker role, a reducer. Reducers are dedicated to one function only: aggregating gradients from workers. And because of their limited functionality, reducers don’t require a lot of computational power and can run on relatively inexpensive compute nodes.

The following diagram shows a cluster with four GPU workers and five reducers. GPU workers maintain model replicas, calculate gradients, and update parameters. Reducers receive blocks of gradients from the GPU workers, reduce the blocks and redistribute the reduced blocks back to the GPU workers.

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_PyTorch_training.1000063620001115.max-2000x2000.jpg

To perform the all-reduce operation, the gradient array on each GPU worker is first partitioned into M blocks, where M is the number of reducers. A given reducer processes the same partition of the gradient from all GPU workers. For example, as shown on the above diagram, the first reducer reduces the blocks a0 through a3 and the second reducer reduces the blocks b0 through b3. After reducing the received blocks, a reducer sends back the reduced partition to all GPU workers.

If the size of a gradient array is K bytes, each node in the topology sends and receives K bytes of data. That is almost half the data that the Ring and Tree based all-reduce implementations exchange. An additional advantage of Reduction Server is that its latency does not depend on the number of workers.

https://storage.googleapis.com/gweb-cloudblog-publish/images/5_PyTorch_training.1000064320000611.max-2000x2000.jpg

Using Reduction Server with PyTorch

Reduction Server can be used with any distributed training framework that uses the NVIDIA NCCL library for the all-reduce collective operation. You do not need to change or recompile your training application.

In the case of PyTorch, you could use the DistributedDataParallel (DDP) or FullyShardedDataParallel (FSDP) distributed training strategies. Once you’ve made the necessary changes to your PyTorch training code, you can leverage Reduction Server by:

  • Installing the Reduction Server NVIDIA NCCL transport plugging in your training container image.

  • Configuring a Vertex AI Training custom job that includes a Reduction Server worker pool.

 

 

Installing the Reduction Server NVIDIA NCCL transport plugin

 

Reduction Server is implemented as an NVIDIA NCCL transport plugin. This plugin must be installed on the container image that is used to run your training application. The plugin is included in the Vertex AI pre-built training containers.

 

Click here to read more

Be the first to reply!