LLM & AI Guides

Guides for running language model and AI inference workloads on Spheron GPU instances, from interactive chat interfaces to high-throughput OpenAI-compatible API servers.

Choosing the Right Instance for Inference

Workload	Recommended Type	Why
Interactive chat, testing	Spot (RTX 4090)	Cost-effective for low-traffic usage
Production API (7B–13B)	Dedicated (H100 80GB)	Consistent latency, single-GPU throughput
Large models (30B+)	Dedicated (2× A100 80GB)	Multi-GPU tensor parallelism
70B+ models	Cluster (H100 NVLink)	NVLink bandwidth for maximum throughput

Use Spot instances for experiments and development; switch to Dedicated for production traffic.

Available Guides

vLLM Inference Server

OpenAI-compatible inference server using vLLM on H100 or A100. Includes a systemd service for persistence across reboots, SSH tunnel access, and performance tuning flags (--tensor-parallel-size, --dtype, --max-model-len).

Best for: Production API workloads; drop-in replacement for the OpenAI API.

Ollama + Open WebUI

Browser-based chat interface backed by Ollama on an RTX 4090. Docker Compose setup with NVIDIA GPU passthrough; pull any model with a single command.

Best for: Interactive local model usage; exploring models without writing code.

Qwen3-Omni-30B-A3B

Multimodal language model with 30B parameters supporting text, audio, images, and video inputs. 256K context window with A100/H100 deployment.

Best for: Multimodal tasks requiring audio, vision, and text processing in a single model.

Qwen3-VL 4B & 8B

Vision-language models available in 4B and 8B parameter variants. 256K context, multimodal reasoning, and GUI automation capabilities on RTX 4090 or A100.

Best for: Image understanding, visual reasoning, and GUI automation tasks.

Chandra OCR

Specialized OCR model for document processing with 83.1% accuracy, outperforming GPT-4o on document tasks. Supports vLLM deployment for high-throughput document pipelines.

Best for: Document digitization, text extraction, and OCR pipelines.

Soulx Podcast-1.7B

Multi-speaker podcast generation model (1.7B parameters). Generates 60+ minute dialogues with speaker switching, zero-shot voice cloning, and paralinguistics.

Best for: Audio content generation, podcast production, and voice synthesis.

Janus CoderV-8B

8B multimodal code intelligence model. Generates HTML/CSS/React from screenshots, charts, and mockups. Trained on JANUSCODE-800K, the largest multimodal code dataset.

Best for: Visual-to-code translation, layout bug fixing, and UI mockup generation.

Baidu ERNIE-4.5-VL-28B-A3B

Advanced vision-language model from Baidu with 28B active parameters (MoE architecture). Strong visual reasoning and STEM task performance on RTX 4090 or A6000.

Best for: Visual reasoning, multimodal understanding, and STEM-domain tasks.

Additional Resources

Instance Types: Spot vs Dedicated vs Cluster
Networking: SSH tunneling and port access
Cost Optimization: Reducing inference costs with Spot instances
Templates & Images: Copy-ready startup scripts