LLM & AI Guides
Guides for running language model and AI inference workloads on Spheron GPU instances, from interactive chat interfaces to high-throughput OpenAI-compatible API servers.
Choosing the Right Instance for Inference
| Workload | Recommended Type | Why |
|---|---|---|
| Interactive chat, testing | Spot (RTX 4090) | Cost-effective for low-traffic usage |
| Production API (7B–13B) | Dedicated (H100 80GB) | Consistent latency, single-GPU throughput |
| Large models (30B+) | Dedicated (2× A100 80GB) | Multi-GPU tensor parallelism |
| 70B+ models | Cluster (H100 NVLink) | NVLink bandwidth for maximum throughput |
Use Spot instances for experiments and development; switch to Dedicated for production traffic.
Available Guides
vLLM Inference Server
OpenAI-compatible inference server using vLLM on H100 or A100. Includes a systemd service for persistence across reboots, SSH tunnel access, and performance tuning flags (--tensor-parallel-size, --dtype, --max-model-len).
Best for: Production API workloads; drop-in replacement for the OpenAI API.
Ollama + Open WebUI
Browser-based chat interface backed by Ollama on an RTX 4090. Docker Compose setup with NVIDIA GPU passthrough; pull any model with a single command.
Best for: Interactive local model usage; exploring models without writing code.
Qwen3-Omni-30B-A3B
Multimodal language model with 30B parameters supporting text, audio, images, and video inputs. 256K context window with A100/H100 deployment.
Best for: Multimodal tasks requiring audio, vision, and text processing in a single model.
Qwen3-VL 4B & 8B
Vision-language models available in 4B and 8B parameter variants. 256K context, multimodal reasoning, and GUI automation capabilities on RTX 4090 or A100.
Best for: Image understanding, visual reasoning, and GUI automation tasks.
Chandra OCR
Specialized OCR model for document processing with 83.1% accuracy, outperforming GPT-4o on document tasks. Supports vLLM deployment for high-throughput document pipelines.
Best for: Document digitization, text extraction, and OCR pipelines.
Soulx Podcast-1.7B
Multi-speaker podcast generation model (1.7B parameters). Generates 60+ minute dialogues with speaker switching, zero-shot voice cloning, and paralinguistics.
Best for: Audio content generation, podcast production, and voice synthesis.
Janus CoderV-8B
8B multimodal code intelligence model. Generates HTML/CSS/React from screenshots, charts, and mockups. Trained on JANUSCODE-800K, the largest multimodal code dataset.
Best for: Visual-to-code translation, layout bug fixing, and UI mockup generation.
Baidu ERNIE-4.5-VL-28B-A3B
Advanced vision-language model from Baidu with 28B active parameters (MoE architecture). Strong visual reasoning and STEM task performance on RTX 4090 or A6000.
Best for: Visual reasoning, multimodal understanding, and STEM-domain tasks.
Additional Resources
- Instance Types: Spot vs Dedicated vs Cluster
- Networking: SSH tunneling and port access
- Cost Optimization: Reducing inference costs with Spot instances
- Templates & Images: Copy-ready startup scripts