Skip to content

Qwen3-Omni-30B-A3B

Multimodal language model with 30B parameters. Processes text, audio, images, and video inputs for comprehensive multimodal understanding.

Note: This is Qwen3-Omni (audio-capable), different from Qwen3-VL (vision-language only).

Key Capabilities

  • Multimodal inputs - Text, audio, images, video
  • Audio understanding - Speech recognition, audio analysis
  • Vision-language - Image understanding and generation
  • Long context - 256K native context window
  • Multilingual - 119+ languages and dialects

Use cases: Audio transcription, multimodal chat, content analysis, accessibility tools

Requirements

Hardware:
  • GPU: A100 or H100 (30B model needs significant VRAM)
  • VRAM: 24GB+ minimum, 40GB+ recommended
  • RAM: 32GB+
  • Storage: 60GB (SSD recommended)
Software:
  • Ubuntu 22.04 LTS
  • CUDA 12.1+
  • Python 3.11
  • Conda/Miniconda

Deploy on Spheron

  1. Sign up at app.spheron.ai
  2. Add credits (card/crypto)
  3. DeployA100 or H100 → Region → Ubuntu 22.04 → SSH key → Deploy
Connect:
ssh -i <private-key-path> root@<your-vm-ip>

New to Spheron? Getting Started | SSH Setup

Installation

Install Miniconda

curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3
$HOME/miniconda3/bin/conda init bash
source ~/.bashrc

Create Environment

conda create -n qwen python=3.11 -y && conda activate qwen

Accept ToS if prompted:

conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

Install PyTorch (CUDA 12.1)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Install Dependencies

pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install einops timm pillow sentencepiece protobuf decord numpy requests
pip install bitsandbytes

Create test.py

Create inference script:

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
 
# Load the model on available devices
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-4B-Thinking",
    dtype="auto",
    device_map="auto"
)
 
# Optional: Enable flash_attention_2 for better performance and memory efficiency,
# especially in multi-image or video tasks.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen3-VL-4B-Thinking",
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )
 
# Load the processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Thinking")
 
# Define input messages (image + text prompt)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
 
# Prepare inputs for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)
 
# Generate model output
generated_ids = model.generate(**inputs, max_new_tokens=128)
 
# Extract generated tokens (excluding prompt tokens)
generated_ids_trimmed = [
    output[len(input_ids):] for input_ids, output in zip(inputs.input_ids, generated_ids)
]
 
# Decode output text
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)
 
print(output_text)

Run Script

conda activate qwen
python3 test.py

Configuration

Model Variants:
  • Change to Qwen3-VL-8B-Thinking for larger model (needs more VRAM)
Precision:
  • dtype=torch.float16 or torch.bfloat16 (A100/H100)
Flash Attention:
  • Add attn_implementation="flash_attention_2" (if supported)
Device:
  • device_map="auto" (default, recommended)
  • device_map={"":0} (single GPU)
Local Images:
  • Use load_image_from_path() instead of URL

Troubleshooting

OOM:
  • Reduce max_new_tokens
  • Use dtype=torch.float16
  • Enable quantization (bitsandbytes)
Slow Loading:
  • Cache models locally
  • Use NVMe storage
  • Enable use_safetensors=True
CUDA Errors:
  • Match torch + CUDA versions: nvidia-smi

Additional Resources