Skip to content

Baidu ERNIE-4.5-VL-28B-A3B-Thinking

Overview

Multimodal reasoning model with "Thinking" mode (Apache 2.0). 28B parameters with MoE design (3B active per token). Competitive with GPT-5-High and Gemini-2.5-Pro on visual reasoning, STEM, charts, and video understanding.

Released: November 10, 2025 by Baidu
Architecture: ERNIE-4.5-VL-28B-A3B + reasoning fine-tuning (GSPO, IcePop)
Training: Visual-language reasoning datasets with multimodal RL

Key Capabilities

  • Visual Reasoning - Multi-step reasoning, chart analysis, causal relationships
  • STEM Reasoning - Math, science, engineering from images
  • Visual Grounding - Object localization, industrial QC/automation
  • Dynamic Detail Focus - Zooms into regions, chain-of-thought over visuals
  • Tool Calling - Image search, cropping, web lookup integration
  • Video Understanding - Temporal awareness, event localization, frame tracking

Use cases: Multimodal agents, document automation, visual search, education, video analysis

Requirements

Hardware:
  • GPU: RTX 4090 or RTX A6000
  • RAM: 16GB+
  • Storage: 20GB free
  • VRAM: 24GB+ recommended
Software:
  • Ubuntu 22.04 LTS
  • CUDA 12.1+
  • Python 3.11
  • Conda/Miniconda

Deploy on Spheron

  1. Sign up at app.spheron.ai
  2. Add credits (card/crypto)
  3. Deploy → Select RTX 4090 → Region → Ubuntu 22.04 → SSH key → Deploy

Instance ready in 60 seconds.

Connect:
ssh -i <private-key-path> root@<your-vm-ip>

New to Spheron? See Getting Started | SSH Setup

Installation

Install Miniconda

curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3
$HOME/miniconda3/bin/conda init bash
source ~/.bashrc

Create Python Environment

conda create -n korean python=3.11 -y && conda activate korean

Install Dependencies

pip install torch torchvision torchaudio einops timm pillow
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install sentencepiece bitsandbytes protobuf decord numpy

Install Jupyter

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

Access Jupyter (Local Machine)

SSH port forwarding from your local machine:

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Copy Jupyter URL from server terminal to your browser.

Run Model

Load Model

Open notebook and run:

import torch
from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM
 
model_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking'
 
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    dtype=torch.bfloat16,
    trust_remote_code=True
)
 
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model.add_image_preprocess(processor)

Run your first inference

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in the image and what is the color of the dog"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://images.pexels.com/photos/58997/pexels-photo-58997.jpeg"
                }
            },
        ]
    },
]
 
text = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
 
image_inputs, video_inputs = processor.process_vision_info(messages)
 
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
 
device = next(model.parameters()).device
inputs = inputs.to(device)
 
generated_ids = model.generate(
    inputs=inputs['input_ids'].to(device),
    **inputs,
    max_new_tokens=1024,
    use_cache=False
)
 
output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):])
print(output_text)

Additional Resources

Performance: Matches GPT-5-High and Gemini-2.5-Pro on chart analysis, document understanding, video reasoning, and STEM tasks.