PyTorch Environment
What's Included
Core Frameworks:- PyTorch 2.x with CUDA support
- torchvision and torchaudio
- Hugging Face Transformers and Accelerate
- Jupyter Notebook for interactive development
- TensorBoard for training visualization
- NumPy, Pandas, Matplotlib
- datasets and scikit-learn
- Ubuntu 22.04 or 24.04 LTS
- NVIDIA drivers (550 or 570) pre-installed
- CUDA toolkit (12.x)
- Python 3.10–3.11
CUDA and NVIDIA Drivers
PyTorch requires a compatible CUDA version and NVIDIA driver. Spheron GPU images come with NVIDIA drivers and CUDA pre-installed.
Verify your driver and CUDA versions after connecting:# Check NVIDIA driver version
nvidia-smi
# Check CUDA compiler version
nvcc --version
# Check which CUDA versions are installed
ls /usr/local/ | grep cuda| PyTorch Version | CUDA 11.8 | CUDA 12.1 | CUDA 12.4 |
|---|---|---|---|
| 2.0.x | ✓ | ✓ | N/A |
| 2.1.x | ✓ | ✓ | N/A |
| 2.2.x | ✓ | ✓ | ✓ |
| 2.3.x+ | ✓ | ✓ | ✓ |
Always match the --index-url in your pip install command to your CUDA version (see PyTorch install page).
Deploying a PyTorch Environment
Using a Pre-configured OS Image
- Go to app.spheron.ai → Deploy
- Choose your GPU
- Select OS: Ubuntu 24.04 LTS ML PyTorch or Ubuntu 24.04 LTS ML Everything
- Deploy; instance ready in 30–60 seconds
Using a Startup Script
Use the PyTorch + CUDA 12.1 startup template to install PyTorch on a base Ubuntu image:
#cloud-config
runcmd:
- apt-get update -y
- apt-get install -y python3.11 python3.11-venv
- python3.11 -m ensurepip --upgrade
- python3.11 -m pip install --upgrade pip
- python3.11 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
- python3.11 -m pip install transformers accelerate bitsandbytes datasetsVerify Installation
After connecting via SSH:
# Check PyTorch version and CUDA availability
python3 -c "import torch; print('PyTorch:', torch.__version__); print('CUDA available:', torch.cuda.is_available()); print('CUDA version:', torch.version.cuda)"
# Check GPU count and names
python3 -c "import torch; print('GPU count:', torch.cuda.device_count()); [print(f' GPU {i}:', torch.cuda.get_device_name(i)) for i in range(torch.cuda.device_count())]"
# Run nvidia-smi to see GPU utilization
nvidia-smiExpected output on an H100 instance:
PyTorch: 2.3.0+cu121
CUDA available: True
CUDA version: 12.1
GPU count: 8
GPU 0: NVIDIA H100 80GB HBM3
...Quick Start
Basic GPU Computation
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Move tensor to GPU
x = torch.randn(1000, 1000).to(device)
y = torch.matmul(x, x.T)
print("Matrix multiply done, result shape:", y.shape)Load a Hugging Face Model on GPU
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "meta-llama/Llama-3-8B-Instruct" # replace with your model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto", # spreads across all available GPUs
)
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Common Packages
# Large language models
pip install transformers accelerate bitsandbytes peft trl
# Computer vision
pip install torchvision timm opencv-python pillow
# Distributed training
pip install deepspeed
# Experiment tracking
pip install wandb tensorboard
# Data
pip install datasets huggingface-hubTroubleshooting
CUDA not available:# Confirm NVIDIA driver is loaded
nvidia-smi
# Reinstall PyTorch matching your CUDA version
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121- Reduce batch size
- Enable gradient checkpointing:
model.gradient_checkpointing_enable() - Use BF16:
model = model.to(torch.bfloat16) - Monitor memory:
nvidia-smi -l 1ortorch.cuda.memory_summary()
# Per-GPU utilization
nvidia-smi dmon -s u
# NVLink status (on NVLink clusters)
nvidia-smi nvlink --statusAdditional Resources
- Distributed Training guide: PyTorch DDP on H100 bare-metal clusters
- Templates & Images: PyTorch startup script
- Ubuntu Environments: OS images with CUDA and NVIDIA drivers
- TensorFlow: TensorFlow GPU environment