Training Guides
Guides for running model training workloads on Spheron GPU instances, from single-GPU fine-tuning to large-scale distributed training on bare-metal H100 clusters.
Choosing the Right Instance for Training
| Workload | Recommended Type | Why |
|---|---|---|
| Experiments, prototyping | Spot | 30–60% cheaper; interrupt-safe with checkpointing |
| Single-GPU fine-tuning | Dedicated (RTX 4090 / A100) | No interruption risk for multi-hour runs |
| Multi-GPU distributed training | Cluster (Voltage Park H100) | NVLink interconnect, full physical server access |
| Production training runs (days) | Dedicated or Cluster | Guaranteed availability |
Use Spot instances for experiments; they save significant cost. Enable checkpoint saving to a persistent volume so work survives if the instance is reclaimed.
Available Guides
Distributed Training (PyTorch DDP)
Multi-GPU PyTorch DDP and DeepSpeed ZeRO-3 training on a Voltage Park bare-metal H100 cluster (up to 8× H100 NVLink). Covers torchrun invocation, gradient checkpointing, BF16 mixed precision, checkpoint persistence, and GPU monitoring.
Best for: Large language model pre-training and fine-tuning; multi-day training runs on 8× H100 NVLink.
Additional Resources
- Instance Types: Spot vs Dedicated vs Cluster
- Volume Mounting: Persistent checkpoint storage
- Cost Optimization: Reducing training costs with Spot and Reserved GPUs