Baidu ERNIE-4.5-VL-28B-A3B-Thinking
Overview
Multimodal reasoning model with "Thinking" mode (Apache 2.0). 28B parameters with MoE design (3B active per token). Competitive with GPT-5-High and Gemini-2.5-Pro on visual reasoning, STEM, charts, and video understanding.
Released: November 10, 2025 by Baidu
Architecture: ERNIE-4.5-VL-28B-A3B + reasoning fine-tuning (GSPO, IcePop)
Training: Visual-language reasoning datasets with multimodal RL
Key Capabilities
- Visual Reasoning - Multi-step reasoning, chart analysis, causal relationships
- STEM Reasoning - Math, science, engineering from images
- Visual Grounding - Object localization, industrial QC/automation
- Dynamic Detail Focus - Zooms into regions, chain-of-thought over visuals
- Tool Calling - Image search, cropping, web lookup integration
- Video Understanding - Temporal awareness, event localization, frame tracking
Use cases: Multimodal agents, document automation, visual search, education, video analysis
Requirements
Hardware:- GPU: RTX 4090 or RTX A6000
- RAM: 16GB+
- Storage: 20GB free
- VRAM: 24GB+ recommended
- Ubuntu 22.04 LTS
- CUDA 12.1+
- Python 3.11
- Conda/Miniconda
Deploy on Spheron
- Sign up at app.spheron.ai
- Add credits (card/crypto)
- Deploy → Select RTX 4090 → Region → Ubuntu 22.04 → SSH key → Deploy
Instance ready in 60 seconds.
Connect:ssh -i <private-key-path> root@<your-vm-ip>New to Spheron? See Getting Started | SSH Setup
Installation
Install Miniconda
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3
$HOME/miniconda3/bin/conda init bash
source ~/.bashrcCreate Python Environment
conda create -n korean python=3.11 -y && conda activate koreanInstall Dependencies
pip install torch torchvision torchaudio einops timm pillow
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install sentencepiece bitsandbytes protobuf decord numpyInstall Jupyter
conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-rootAccess Jupyter (Local Machine)
SSH port forwarding from your local machine:
ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>Copy Jupyter URL from server terminal to your browser.
Run Model
Load Model
Open notebook and run:
import torch
from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM
model_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking'
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
dtype=torch.bfloat16,
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model.add_image_preprocess(processor)Run your first inference
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in the image and what is the color of the dog"
},
{
"type": "image_url",
"image_url": {
"url": "https://images.pexels.com/photos/58997/pexels-photo-58997.jpeg"
}
},
]
},
]
text = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
device = next(model.parameters()).device
inputs = inputs.to(device)
generated_ids = model.generate(
inputs=inputs['input_ids'].to(device),
**inputs,
max_new_tokens=1024,
use_cache=False
)
output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):])
print(output_text)Additional Resources
- Model on HuggingFace
- Getting Started - Spheron deployment basics
- API Reference - Programmatic deployment
Performance: Matches GPT-5-High and Gemini-2.5-Pro on chart analysis, document understanding, video reasoning, and STEM tasks.