Distributed Training (PyTorch DDP)

Run large-scale distributed training with PyTorch DDP or DeepSpeed on a Voltage Park bare-metal H100 NVLink cluster.

Recommended Hardware

Instance Type Overview

Spheron GPU offerings are classified by two criteria: interruptibility and hardware isolation.

All Spot instances are VM-based and can be reclaimed by the provider at any time. Use Spot only for fault-tolerant jobs with checkpointing. Dedicated instances carry a 99.95% SLA and are not reclaimed after deployment.

Within Dedicated, three hardware isolation options are available:

VM: Runs in an isolated virtual machine on shared physical hardware. The default across most providers and GPU offers.
Bare Metal: Full physical server with no hypervisor, no shared tenants. GPU count varies by offer and provider, from single-GPU up to multi-GPU servers. On the dashboard, identified by the BAREMETAL suffix in the GPU type name.
Cluster: Entire 8-GPU bare-metal server with a dedicated high-speed interconnect. Select InfiniBand (3.2 Tbps) for all-reduce-heavy workloads where GPU-to-GPU bandwidth is the bottleneck, or Ethernet (100 Gbps) for lower-cost distributed jobs. Currently available on Voltage Park only; additional providers are in progress. Identified by the CLUSTER designation in the offers list.

For multi-GPU distributed training, use a Dedicated Cluster instance. It provides bare-metal access to all 8 GPUs on a single host through a direct hardware interconnect, enabling efficient gradient synchronization across processes.

Provider: Voltage Park (the only provider currently offering Cluster instances; additional providers are in progress) Offer: Look for offers with the CLUSTER designation in the offers list

Interconnect Options

The interconnect type is determined by the offer you select. Two options are available on Voltage Park:

Interconnect	Bandwidth	Notes
InfiniBand	3.2 Tbps	H100 SXM5 with NVLink; optimal for all-reduce-heavy DDP and ZeRO-3 runs
Ethernet	100 Gbps	Lower cost; sufficient for distributed jobs where network bandwidth is not the bottleneck

Choose InfiniBand for large model training where gradient synchronization is a bottleneck. Ethernet is adequate for smaller distributed jobs or when cost is the priority.

Deploy the Instance

Deploy a Voltage Park Cluster instance from the dashboard. On the Deploy GPUs page, filter by Cluster and select an H100 NVLink or H100 Ethernet offer (8 GPUs). Choose Ubuntu 22.04 as the operating system and attach your SSH key.

Running Distributed Training with `torchrun`

Once SSH'd into the instance, launch your training script with torchrun:

torchrun \
  --nproc_per_node=8 \
  --nnodes=1 \
  train.py \
  --batch_size 32 \
  --gradient_checkpointing

--nproc_per_node=8 uses all 8 H100 GPUs. For a 4-GPU offer, use --nproc_per_node=4.

PyTorch DDP Training Script

Minimal example of a DDP-compatible training loop:

import argparse
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
 
def setup():
    dist.init_process_group(backend="nccl")
    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
 
def cleanup():
    dist.destroy_process_group()
 
def train():
    parser = argparse.ArgumentParser()
    parser.add_argument("--batch_size", type=int, default=8)
    parser.add_argument("--gradient_checkpointing", action="store_true")
    args = parser.parse_args()
 
    setup()
    rank = dist.get_rank()
    local_rank = int(os.environ["LOCAL_RANK"])
 
    model = YourModel().to(local_rank)
    model = DDP(model, device_ids=[local_rank])
 
    # Enable gradient checkpointing to reduce VRAM usage
    if args.gradient_checkpointing:
        model.module.gradient_checkpointing_enable()
 
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
 
    # DistributedSampler ensures each worker sees a disjoint shard of the data
    dataset = YourDataset()  # replace with your dataset
    sampler = DistributedSampler(dataset)
    dataloader = DataLoader(dataset, batch_size=args.batch_size, sampler=sampler)
 
    num_epochs = 3
    for epoch in range(num_epochs):
        # Reshuffle the dataset differently for each epoch across all workers
        sampler.set_epoch(epoch)
 
        for step, batch in enumerate(dataloader):
            with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
                loss = model(**batch).loss
 
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
 
            # Save checkpoint every 100 steps
            if step % 100 == 0 and rank == 0:
                torch.save({
                    'step': step,
                    'epoch': epoch,
                    'model_state_dict': model.module.state_dict(),
                    'optimizer_state_dict': optimizer.state_dict(),
                }, f'/checkpoints/checkpoint_epoch{epoch}_step{step}.pt')
 
    cleanup()
 
if __name__ == '__main__':
    train()

DeepSpeed ZeRO-3 for Models >30B

For models too large to fit in a single GPU's memory, use DeepSpeed ZeRO-3 to shard parameters, gradients, and optimizer states across all GPUs.

ds_config.json:

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu", "pin_memory": true },
    "offload_param": { "device": "cpu", "pin_memory": true },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 5e8,
    "stage3_prefetch_bucket_size": 5e7,
    "stage3_param_persistence_threshold": 1e6
  },
  "bf16": { "enabled": true },
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true
  },
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 8
}

Launch with DeepSpeed:

deepspeed --num_gpus=8 train.py \
  --deepspeed ds_config.json \
  --model_name_or_path meta-llama/Llama-3-70b

Mixed Precision (BF16)

H100s have native BF16 support. Always use BF16 for training on H100s; it is faster and more numerically stable than FP16:

with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
    outputs = model(**batch)

Checkpoint Persistence

Mount a Voltage Park NFS volume at /checkpoints before your training run to protect checkpoints across deployments:

Create a volume: see Voltage Park Volume Mounting
Mount it at /checkpoints in your cloud-init script
Save checkpoints to /checkpoints/ in your training loop (example above)

Dataset Storage

For large datasets, use External NVMe storage on Voltage Park; it provides much higher local I/O bandwidth than NFS volumes.

See External Storage Access for setup instructions.

GPU Monitoring

Watch per-GPU utilization during training:

nvidia-smi dmon -s u

Check NVLink health and bandwidth:

nvidia-smi nvlink --status
nvidia-smi nvlink --capabilities

Monitor GPU memory:

nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1

Additional Resources

Voltage Park Volume Mounting: Persistent checkpoint storage
External Storage Access: NVMe for large datasets
Instance Types: Spot vs Dedicated, and hardware isolation categories (VM, Bare Metal, Cluster)
Cost Optimization: Reserved GPU pricing for long-term training