Ollama + Open WebUI
Run Ollama with Open WebUI on an RTX 4090 Spheron instance. Open WebUI provides a browser-based chat interface backed by any model Ollama can load into VRAM.
Recommended Hardware
GPU: RTX 4090 (24GB VRAM) Instance Type: Dedicated or Spot OS: Ubuntu 22.04 LTS
The RTX 4090 supports:
- Models up to ~13B parameters at Q4 quantization
- Models up to ~7B parameters in full precision (FP16)
For larger models (30B+), use an A100 or H100 instead.
Cloud-Init Startup Script
Paste this into the Startup Script field when deploying. It installs Docker with the NVIDIA Container Toolkit, writes a docker-compose.yml, and starts both services.
#cloud-config
runcmd:
- apt-get update -y
- apt-get install -y ca-certificates curl gnupg
- install -m 0755 -d /etc/apt/keyrings
- curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
- chmod a+r /etc/apt/keyrings/docker.gpg
- echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" > /etc/apt/sources.list.d/docker.list
- apt-get update -y
- apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
- curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
- curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' > /etc/apt/sources.list.d/nvidia-container-toolkit.list
- apt-get update -y
- apt-get install -y nvidia-container-toolkit
- nvidia-ctk runtime configure --runtime=docker
- systemctl restart docker
- mkdir -p /opt/ollama
- |
cat > /opt/ollama/docker-compose.yml << 'EOF'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- webui_data:/app/backend/data
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_data:
webui_data:
EOF
- chmod 644 /opt/ollama/docker-compose.yml
- docker compose -f /opt/ollama/docker-compose.yml up -dAccess the Web UI
SSH Tunnel (Recommended)
ssh -L 3000:localhost:3000 <user>@<ipAddress>Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances). Then open http://localhost:3000 in your browser. On first launch, create an admin account.
Pull and Use Models
Pull a model
docker exec -it ollama ollama pull llama3.2Popular models for RTX 4090:
docker exec -it ollama ollama pull llama3.2 # 3B, fast
docker exec -it ollama ollama pull llama3.1:8b # 8B, good balance
docker exec -it ollama ollama pull mistral:7b # 7B, good quality
docker exec -it ollama ollama pull codellama:13b # 13B coding modelCLI usage
docker exec -it ollama ollama run llama3.2API usage
Ollama exposes an OpenAI-compatible API on port 11434. It is published to the host, so you can tunnel it the same way:
ssh -L 11434:localhost:11434 <user>@<ipAddress>Then query it:
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'Memory Guidelines for RTX 4090 (24GB VRAM)
| Model | Quantization | VRAM | Fits on 4090? |
|---|---|---|---|
| 7B | FP16 | ~14GB | Yes |
| 7B | Q4 | ~4GB | Yes |
| 13B | Q4 | ~8GB | Yes |
| 30B | Q4 | ~20GB | Yes (tight) |
| 70B | Q4 | ~40GB | No; use A100/H100 |
Additional Resources
- vLLM Inference Server: OpenAI-compatible API for production workloads
- Templates & Images: Additional startup script templates
- Networking: SSH tunneling and port access
- Instance Types: Choosing the right GPU for your model