CUDA & NVIDIA Drivers

What Are NVIDIA Drivers?

NVIDIA drivers are software components that let the operating system communicate with the GPU hardware. Without a compatible driver, the GPU cannot be used for any compute workload.

On Spheron instances, NVIDIA drivers come pre-installed on all GPU images. You don't need to install them manually.

Key points:

Drivers are specific to the GPU architecture (e.g., Hopper for H100, Ampere for A100)
Each driver version exposes a maximum supported CUDA version
Driver version ≠ CUDA version; they are separate but must be compatible

Check the installed driver after connecting:

nvidia-smi

Example output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08   CUDA Version: 12.4      |
+-----------------------------------------------------------------------------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|=========================================================================================|
|   0  NVIDIA H100 80GB HBM3           Off |   00000000:00:00.0 Off |                    0 |
| N/A   34C    P0             72W / 700W |      0MiB / 81920MiB |      0%      Default |
+-----------------------------------------------------------------------------------------+

The Driver Version and CUDA Version fields tell you exactly what is installed.

What Is CUDA?

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform that lets software run computations on the GPU. Nearly all AI/ML frameworks depend on it.

CUDA has two components:

Component	What it is	How to check
CUDA Runtime	Libraries used by your application	`nvidia-smi` → shows max supported CUDA
CUDA Toolkit	Compiler (`nvcc`) + dev tools	`nvcc --version`

# Compiler version (CUDA toolkit)
nvcc --version
 
# List all CUDA installations
ls /usr/local/ | grep cuda

How CUDA and Drivers Affect Development

The relationship between drivers, CUDA, and your frameworks determines what works:

GPU Hardware
    └── NVIDIA Driver  (minimum requirement)
            └── CUDA Runtime  (must be ≤ driver's max CUDA)
                    └── Framework (PyTorch, TensorFlow, JAX…)
                              └── Your Code

What this means in practice:

A newer driver supports a higher maximum CUDA version, but is backward-compatible with older CUDA runtimes
If your framework requires CUDA 12.4 but the instance only has CUDA 12.0, builds or training runs will fail
Mismatched versions are the most common source of CUDA not available errors

Framework ↔ CUDA compatibility quick reference:

Framework	Minimum CUDA	Recommended CUDA
PyTorch 2.3+	11.8	12.1 – 12.4
TensorFlow 2.16+	12.3	12.3 – 12.4
JAX (latest)	12.0	12.4+
vLLM 0.4+	12.1	12.4

Always check the framework's official docs for the exact compatibility matrix before selecting a CUDA version.

Available CUDA Versions on Spheron

CUDA Version	NVIDIA Driver	Notes
12.0	525+	Maximum compatibility with older frameworks
12.4	550+	Stable, broadly compatible; good default
12.6	560+	Optimized for RTX 5090, H100, newer GPUs
12.8 Open	570+ (open-source)	Open-source kernel module, community use
13.0 Open	575+ (open-source)	Latest features; early adoption and research use

Open-source drivers (12.8 Open, 13.0 Open) use NVIDIA's open-source kernel module instead of the proprietary driver. They are functionally equivalent for most AI/ML workloads but preferred in community and research environments.

Choosing a Driver Version at Deployment

When deploying an instance on Spheron, the CUDA version and driver are selected via the OS image dropdown; they are bundled together.

Step-by-step

Go to app.spheron.ai → Deploy
Select your GPU
Open the OS / Environment dropdown
Choose an image that includes your desired CUDA version:

Goal	Recommended Image
Stable AI/ML work	`Ubuntu 22.04 + CUDA 12.4` or `Ubuntu 24.04 ML PyTorch`
Latest GPU support (H100, RTX 5090)	`Ubuntu 24.04 + CUDA 12.6`
Open-source driver preference	`Ubuntu 22.04 + CUDA 12.8 Open`
Research and early adoption	`Ubuntu 24.04 + CUDA 13.0 Open`
Legacy framework compatibility	`Ubuntu 20.04 + CUDA 12.0`

Deploy; the instance is ready in 30–60 seconds with the driver already loaded

You cannot change the CUDA version after deployment. If you need a different version, deploy a new instance with the correct image.

Verify After Deployment

Once connected via SSH, confirm the environment is set up correctly:

# Driver version and max supported CUDA
nvidia-smi
 
# CUDA toolkit version (compiler)
nvcc --version
 
# Installed CUDA directories
ls /usr/local/ | grep cuda
 
# Quick Python check (PyTorch)
python3 -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('CUDA version:', torch.version.cuda)"

Troubleshooting

nvidia-smi: command not found The instance may have launched on a CPU-only node or the driver failed to load. Redeploy with a GPU image.

CUDA not available in PyTorch/TensorFlow The framework's CUDA build doesn't match the installed runtime. Reinstall the framework with the correct CUDA wheel:

# PyTorch example - match cu124 to your CUDA version
pip install torch --index-url https://download.pytorch.org/whl/cu124

nvcc: command not found but nvidia-smi works The CUDA toolkit (compiler) isn't installed, only the runtime. Install it:

apt-get install -y cuda-toolkit-12-4

Version mismatch between nvidia-smi and nvcc This is expected behavior: nvidia-smi shows the driver's maximum supported CUDA, while nvcc shows the toolkit version. Both are valid as long as the toolkit version ≤ the driver's max CUDA.

Additional Resources

Ubuntu Environments: Full list of OS images and configurations
PyTorch Environment: PyTorch + CUDA setup
TensorFlow Environment: TensorFlow + CUDA setup
Templates & Images: Pre-built startup scripts