Training Guides

Guides for running model training workloads on Spheron GPU instances, from single-GPU fine-tuning to large-scale distributed training on bare-metal H100 clusters.

Choosing the Right Instance for Training

Workload	Recommended Type	Why
Experiments, prototyping	Spot	30–60% cheaper; interrupt-safe with checkpointing
Single-GPU fine-tuning	Dedicated (RTX 4090 / A100)	No interruption risk for multi-hour runs
Multi-GPU distributed training	Cluster (Voltage Park H100)	NVLink interconnect, full physical server access
Production training runs (days)	Dedicated or Cluster	Guaranteed availability

Use Spot instances for experiments; they save significant cost. Enable checkpoint saving to a persistent volume so work survives if the instance is reclaimed.

Available Guides

Distributed Training (PyTorch DDP)

Multi-GPU PyTorch DDP and DeepSpeed ZeRO-3 training on a Voltage Park bare-metal H100 cluster (up to 8× H100 NVLink). Covers torchrun invocation, gradient checkpointing, BF16 mixed precision, checkpoint persistence, and GPU monitoring.

Best for: Large language model pre-training and fine-tuning; multi-day training runs on 8× H100 NVLink.

Additional Resources

Instance Types: Spot vs Dedicated vs Cluster
Volume Mounting: Persistent checkpoint storage
Cost Optimization: Reducing training costs with Spot and Reserved GPUs