Qwen3-VL 4B & 8B
Vision-language models with advanced reasoning capabilities. Process text, images, and video with 256K native context (scalable to 1M tokens).
Models:- 4B: 4.83B parameters
- 8B-Instruct: 8.77B parameters
- 8B-Thinking: Enhanced reasoning variant
Training: 36 trillion tokens, 119 languages/dialects
Key Features
Architecture:- Interleaved-MRoPE - Multi-resolution position embeddings for long video reasoning
- DeepStack - Multi-level ViT feature fusion for fine-grained detail
- Text-Timestamp Alignment - Precise event localization in videos
- Visual agents (GUI automation, OS World, Android Control)
- Visual coding (mockups → HTML/CSS/JS, Draw.io diagrams)
- Spatial understanding (2D/3D grounding, position/viewpoint)
- OCR (32 languages, low-light/blur/tilt robust)
- 8B-Thinking: MathVision 36.8 | MMMU 61.7 | MathVista 71.3
- 235B: Beats Gemini 2.5 Pro and GPT-5 on agents, docs, spatial tasks
Requirements
Hardware:- GPU: RTX 4090, A6000, A100, H100
- VRAM: 8GB minimum, 16GB+ recommended
- RAM: 16GB+
- Storage: 10GB+ (SSD recommended)
- Ubuntu 22.04 LTS
- CUDA 12.1+
- Python 3.11
- Conda/Miniconda
Note: FP8-quantized versions reduce VRAM significantly (block size 128)
Deploy on Spheron
- Sign up at app.spheron.ai
- Add credits (card/crypto)
- Deploy → RTX 4090/A100 → Region → Ubuntu 22.04 → SSH key → Deploy
ssh -i <private-key-path> root@<your-vm-ip>New to Spheron? Getting Started | SSH Setup
Installation
Install Miniconda
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3
$HOME/miniconda3/bin/conda init bash
source ~/.bashrcCreate Environment
conda create -n qwen python=3.11 -y && conda activate qwenAccept ToS if prompted:
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/rInstall PyTorch (CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121Install Dependencies
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install einops timm pillow sentencepiece protobuf decord numpy requests
pip install bitsandbytesCreate test.py
Create inference script:
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
# Load the model on available devices
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-4B-Thinking",
dtype="auto",
device_map="auto"
)
# Optional: Enable flash_attention_2 for better performance and memory efficiency,
# especially in multi-image or video tasks.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
# "Qwen/Qwen3-VL-4B-Thinking",
# dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# Load the processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Thinking")
# Define input messages (image + text prompt)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Prepare inputs for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
# Generate model output
generated_ids = model.generate(**inputs, max_new_tokens=128)
# Extract generated tokens (excluding prompt tokens)
generated_ids_trimmed = [
output[len(input_ids):] for input_ids, output in zip(inputs.input_ids, generated_ids)
]
# Decode output text
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(output_text)Run Script
conda activate qwen
python3 test.pyConfiguration
Model Variants:- 4B:
Qwen/Qwen3-VL-4B-Thinking - 8B:
Qwen/Qwen3-VL-8B-Thinking(needs more VRAM)
dtype=torch.float16ortorch.bfloat16(A100/H100)
- Add
attn_implementation="flash_attention_2"(if supported)
device_map="auto"(recommended)device_map={"":0}(single GPU)
Troubleshooting
OOM:- Reduce
max_new_tokens - Use
dtype=torch.float16 - Enable quantization (bitsandbytes)
- Cache models locally
- Use NVMe storage
- Enable
use_safetensors=True
- Match torch + CUDA versions:
nvidia-smi
Additional Resources
- Qwen3-VL on HuggingFace
- Getting Started - Spheron deployment
- API Reference - Programmatic access