multi-GPU training paradigms

Anastasia Karavdina
Sep 23, 2024
2 min read

I found the overview of training paradigms discussed by Sebastian Raschka in his book "Machine Learning Q and AI" quite interesting. Below is a short summary. Check out the book if you would like to learn more and understand them better!

Model Parallelism

Model parallelism distributes different parts of a large model across multiple GPUs, processing each section sequentially and passing intermediate results between devices. This approach is useful for models too large for a single GPU but requires careful coordination. For example, a simple two-layer neural network can have each layer on a separate GPU, and this can be scaled to more layers and GPUs.

Data Parallelism

Data parallelism divides a minibatch into microbatches, with each GPU processing one microbatch independently. Each GPU calculates the loss and gradients for its microbatch, and the gradients are then combined to update the model. This method allows GPUs to operate concurrently but requires a full model copy on each GPU, which can be limiting if the model is too large.

Tensor Parallelism

Tensor parallelism splits weight and activation matrices across GPUs, dividing operations like matrix multiplications for more efficient parallel processing. This method overcomes memory limitations and improves parallelism compared to model parallelism. However, it can involve significant communication overhead due to frequent synchronization between GPUs.

Pipeline Parallelism

Pipeline parallelism passes activations forward and gradients backward between devices, minimizing idle time and combining aspects of data and model parallelism. It enhances parallelism across layers but may still experience idle periods and complex implementation challenges. The performance benefits might be less compared to data parallelism, especially for smaller models or high communication overhead scenarios.

Sequence Parallelism

Sequence parallelism addresses the quadratic scaling of the self-attention mechanism in transformers by splitting long sequences into smaller chunks distributed across GPUs. This reduces memory constraints but introduces additional communication overhead and requires model duplication across devices. Splitting sequences can also impact accuracy, particularly with longer inputs.

How to choose what to use?

For small models that fit on a single GPU, data parallelism is usually the most efficient. For larger models, model or tensor parallelism is necessary, with tensor parallelism offering better efficiency due to reduced sequential dependencies. Combining data and tensor parallelism is often the best strategy for modern multi-GPU setups.