Distributed Training (PyTorch DDP)
Run large-scale distributed training with PyTorch DDP or DeepSpeed on a Voltage Park bare-metal H100 NVLink cluster.
Recommended Hardware
Instance Type Overview
Spheron GPU offerings are classified by two criteria: interruptibility and hardware isolation.
All Spot instances are VM-based and can be reclaimed by the provider at any time. Use Spot only for fault-tolerant jobs with checkpointing. Dedicated instances carry a 99.95% SLA and are not reclaimed after deployment.
Within Dedicated, three hardware isolation options are available:
- VM: Runs in an isolated virtual machine on shared physical hardware. The default across most providers and GPU offers.
- Bare Metal: Full physical server with no hypervisor, no shared tenants. GPU count varies by offer and provider, from single-GPU up to multi-GPU servers. On the dashboard, identified by the
BAREMETALsuffix in the GPU type name. - Cluster: Entire 8-GPU bare-metal server with a dedicated high-speed interconnect. Select InfiniBand (3.2 Tbps) for all-reduce-heavy workloads where GPU-to-GPU bandwidth is the bottleneck, or Ethernet (100 Gbps) for lower-cost distributed jobs. Currently available on Voltage Park only; additional providers are in progress. Identified by the
CLUSTERdesignation in the offers list.
For multi-GPU distributed training, use a Dedicated Cluster instance. It provides bare-metal access to all 8 GPUs on a single host through a direct hardware interconnect, enabling efficient gradient synchronization across processes.
Provider: Voltage Park (the only provider currently offering Cluster instances; additional providers are in progress)
Offer: Look for offers with the CLUSTER designation in the offers list
Interconnect Options
The interconnect type is determined by the offer you select. Two options are available on Voltage Park:
| Interconnect | Bandwidth | Notes |
|---|---|---|
| InfiniBand | 3.2 Tbps | H100 SXM5 with NVLink; optimal for all-reduce-heavy DDP and ZeRO-3 runs |
| Ethernet | 100 Gbps | Lower cost; sufficient for distributed jobs where network bandwidth is not the bottleneck |
Choose InfiniBand for large model training where gradient synchronization is a bottleneck. Ethernet is adequate for smaller distributed jobs or when cost is the priority.
Deploy the Instance
Deploy a Voltage Park Cluster instance from the dashboard. On the Deploy GPUs page, filter by Cluster and select an H100 NVLink or H100 Ethernet offer (8 GPUs). Choose Ubuntu 22.04 as the operating system and attach your SSH key.
Running Distributed Training with torchrun
Once SSH'd into the instance, launch your training script with torchrun:
torchrun \
--nproc_per_node=8 \
--nnodes=1 \
train.py \
--batch_size 32 \
--gradient_checkpointing--nproc_per_node=8 uses all 8 H100 GPUs. For a 4-GPU offer, use --nproc_per_node=4.
PyTorch DDP Training Script
Minimal example of a DDP-compatible training loop:
import argparse
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
def setup():
dist.init_process_group(backend="nccl")
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
def cleanup():
dist.destroy_process_group()
def train():
parser = argparse.ArgumentParser()
parser.add_argument("--batch_size", type=int, default=8)
parser.add_argument("--gradient_checkpointing", action="store_true")
args = parser.parse_args()
setup()
rank = dist.get_rank()
local_rank = int(os.environ["LOCAL_RANK"])
model = YourModel().to(local_rank)
model = DDP(model, device_ids=[local_rank])
# Enable gradient checkpointing to reduce VRAM usage
if args.gradient_checkpointing:
model.module.gradient_checkpointing_enable()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
# DistributedSampler ensures each worker sees a disjoint shard of the data
dataset = YourDataset() # replace with your dataset
sampler = DistributedSampler(dataset)
dataloader = DataLoader(dataset, batch_size=args.batch_size, sampler=sampler)
num_epochs = 3
for epoch in range(num_epochs):
# Reshuffle the dataset differently for each epoch across all workers
sampler.set_epoch(epoch)
for step, batch in enumerate(dataloader):
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Save checkpoint every 100 steps
if step % 100 == 0 and rank == 0:
torch.save({
'step': step,
'epoch': epoch,
'model_state_dict': model.module.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, f'/checkpoints/checkpoint_epoch{epoch}_step{step}.pt')
cleanup()
if __name__ == '__main__':
train()DeepSpeed ZeRO-3 for Models >30B
For models too large to fit in a single GPU's memory, use DeepSpeed ZeRO-3 to shard parameters, gradients, and optimizer states across all GPUs.
ds_config.json:
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": { "device": "cpu", "pin_memory": true },
"offload_param": { "device": "cpu", "pin_memory": true },
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 5e8,
"stage3_prefetch_bucket_size": 5e7,
"stage3_param_persistence_threshold": 1e6
},
"bf16": { "enabled": true },
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": true
},
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 8
}Launch with DeepSpeed:
deepspeed --num_gpus=8 train.py \
--deepspeed ds_config.json \
--model_name_or_path meta-llama/Llama-3-70bMixed Precision (BF16)
H100s have native BF16 support. Always use BF16 for training on H100s; it is faster and more numerically stable than FP16:
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
outputs = model(**batch)Checkpoint Persistence
Mount a Voltage Park NFS volume at /checkpoints before your training run to protect checkpoints across deployments:
- Create a volume: see Voltage Park Volume Mounting
- Mount it at
/checkpointsin your cloud-init script - Save checkpoints to
/checkpoints/in your training loop (example above)
Dataset Storage
For large datasets, use External NVMe storage on Voltage Park; it provides much higher local I/O bandwidth than NFS volumes.
See External Storage Access for setup instructions.
GPU Monitoring
Watch per-GPU utilization during training:
nvidia-smi dmon -s uCheck NVLink health and bandwidth:
nvidia-smi nvlink --status
nvidia-smi nvlink --capabilitiesMonitor GPU memory:
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 1Additional Resources
- Voltage Park Volume Mounting: Persistent checkpoint storage
- External Storage Access: NVMe for large datasets
- Instance Types: Spot vs Dedicated, and hardware isolation categories (VM, Bare Metal, Cluster)
- Cost Optimization: Reserved GPU pricing for long-term training