vLLM Inference Server
Deploy an OpenAI-compatible inference server using vLLM on Spheron H100 or A100 instances.
Recommended Hardware
| Model Size | Recommended GPU | Instance Type | Notes |
|---|---|---|---|
| 7B–13B | H100 80GB (1×) | Dedicated | Single-GPU, fastest throughput |
| 30B+ | A100 80GB (2×) | Dedicated | Use tensor_parallel_size=2 |
| 70B+ | H100 NVLink (4× or 8×) | Cluster | Use NVLink offers for best bandwidth |
For multi-GPU offers, select an offer with interconnectType: "NVLink" for maximum tensor-parallel performance.
Cloud-Init Startup Script
Paste this into the Startup Script field when deploying your instance. It installs vLLM, starts the server on port 8000, and creates a systemd service for persistence across reboots.
#cloud-config
runcmd:
- apt-get update -y
- apt-get install -y python3-pip
- pip install vllm
- |
cat > /etc/systemd/system/vllm.service << 'EOF'
[Unit]
Description=vLLM Inference Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--tensor-parallel-size 1 \
--port 8000 \
--gpu-memory-utilization 0.9
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
- systemctl daemon-reload
- systemctl enable vllm
- systemctl start vllmReplace meta-llama/Llama-3-8B-Instruct with your target model and adjust --tensor-parallel-size to match the number of GPUs on your instance.
Accessing the Server
SSH Tunnel (Recommended)
ssh -L 8000:localhost:8000 <user>@<ipAddress>Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances). Keep this terminal open, then test from another terminal on your machine:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3-8B-Instruct",
"prompt": "Hello, world!",
"max_tokens": 50
}'List Available Models
curl http://localhost:8000/v1/modelsOpenAI SDK Compatibility
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # vLLM does not require a key by default
)
response = client.chat.completions.create(
model="meta-llama/Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Explain GPU parallelism briefly."}],
)
print(response.choices[0].message.content)Performance Tuning
| Flag | Description | Recommended Value |
|---|---|---|
--tensor-parallel-size | Number of GPUs for tensor parallelism | Match GPU count |
--gpu-memory-utilization | Fraction of GPU VRAM to use for KV cache | 0.9 |
--max-model-len | Maximum sequence length | Reduce if OOM |
--dtype | Model weight precision | fp8 for H100, bfloat16 otherwise |
Example for 2× A100 with BF16 and large context:
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70B-Instruct \
--tensor-parallel-size 2 \
--port 8000 \
--gpu-memory-utilization 0.9 \
--dtype bfloat16 \
--max-model-len 32768Monitoring
Check server logs:
journalctl -u vllm -fWatch GPU utilization:
nvidia-smi dmon -s uAdditional Resources
- Templates & Images: Copy-ready startup scripts
- Ollama + Open WebUI: Alternative for interactive local usage
- Networking: Dedicated IP, port access, and SSH tunneling
- Cost Optimization: GPU tier selection for inference workloads