vLLM Inference Server

Deploy an OpenAI-compatible inference server using vLLM on Spheron H100 or A100 instances.

Recommended Hardware

Model Size	Recommended GPU	Instance Type	Notes
7B–13B	H100 80GB (1×)	Dedicated	Single-GPU, fastest throughput
30B+	A100 80GB (2×)	Dedicated	Use `tensor_parallel_size=2`
70B+	H100 NVLink (4× or 8×)	Cluster	Use NVLink offers for best bandwidth

For multi-GPU offers, select an offer with interconnectType: "NVLink" for maximum tensor-parallel performance.

Cloud-Init Startup Script

Paste this into the Startup Script field when deploying your instance. It installs vLLM, starts the server on port 8000, and creates a systemd service for persistence across reboots.

#cloud-config
runcmd:
  - apt-get update -y
  - apt-get install -y python3-pip
  - pip install vllm
  - |
    cat > /etc/systemd/system/vllm.service << 'EOF'
    [Unit]
    Description=vLLM Inference Server
    After=network.target
 
    [Service]
    Type=simple
    ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
      --model meta-llama/Llama-3-8B-Instruct \
      --tensor-parallel-size 1 \
      --port 8000 \
      --gpu-memory-utilization 0.9
    Restart=on-failure
    RestartSec=10
 
    [Install]
    WantedBy=multi-user.target
    EOF
  - systemctl daemon-reload
  - systemctl enable vllm
  - systemctl start vllm

Replace meta-llama/Llama-3-8B-Instruct with your target model and adjust --tensor-parallel-size to match the number of GPUs on your instance.

Accessing the Server

SSH Tunnel (Recommended)

ssh -L 8000:localhost:8000 <user>@<ipAddress>

Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances). Keep this terminal open, then test from another terminal on your machine:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3-8B-Instruct",
    "prompt": "Hello, world!",
    "max_tokens": 50
  }'

List Available Models

curl http://localhost:8000/v1/models

OpenAI SDK Compatibility

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # vLLM does not require a key by default
)
 
response = client.chat.completions.create(
    model="meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Explain GPU parallelism briefly."}],
)
print(response.choices[0].message.content)

Performance Tuning

Flag	Description	Recommended Value
`--tensor-parallel-size`	Number of GPUs for tensor parallelism	Match GPU count
`--gpu-memory-utilization`	Fraction of GPU VRAM to use for KV cache	`0.9`
`--max-model-len`	Maximum sequence length	Reduce if OOM
`--dtype`	Model weight precision	`fp8` for H100, `bfloat16` otherwise

Example for 2× A100 with BF16 and large context:

python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70B-Instruct \
  --tensor-parallel-size 2 \
  --port 8000 \
  --gpu-memory-utilization 0.9 \
  --dtype bfloat16 \
  --max-model-len 32768

Monitoring

Check server logs:

journalctl -u vllm -f

Watch GPU utilization:

nvidia-smi dmon -s u

Additional Resources

Templates & Images: Copy-ready startup scripts
Ollama + Open WebUI: Alternative for interactive local usage
Networking: Dedicated IP, port access, and SSH tunneling
Cost Optimization: GPU tier selection for inference workloads