Skip to content

Training Guides

Guides for running model training workloads on Spheron GPU instances, from single-GPU fine-tuning to large-scale distributed training on bare-metal H100 clusters.

Choosing the Right Instance for Training

WorkloadRecommended TypeWhy
Experiments, prototypingSpot30–60% cheaper; interrupt-safe with checkpointing
Single-GPU fine-tuningDedicated (RTX 4090 / A100)No interruption risk for multi-hour runs
Multi-GPU distributed trainingCluster (Voltage Park H100)NVLink interconnect, full physical server access
Production training runs (days)Dedicated or ClusterGuaranteed availability

Use Spot instances for experiments; they save significant cost. Enable checkpoint saving to a persistent volume so work survives if the instance is reclaimed.

Available Guides

Distributed Training (PyTorch DDP)

Multi-GPU PyTorch DDP and DeepSpeed ZeRO-3 training on a Voltage Park bare-metal H100 cluster (up to 8× H100 NVLink). Covers torchrun invocation, gradient checkpointing, BF16 mixed precision, checkpoint persistence, and GPU monitoring.

Best for: Large language model pre-training and fine-tuning; multi-day training runs on 8× H100 NVLink.

Additional Resources