InternVL3 Pushes the Boundaries of Open-Source Multimodal AI with Native Training and Smarter Scaling

In the rapidly evolving landscape of multimodal AI, where language models are increasingly expected to see and understand images, videos, and more, InternVL3 emerges as a compelling new contender from the OpenGVLab team. Building on the InternVL series, InternVL3 introduces a fresh approach to training multimodal large language models (MLLMs) that promises not only top-tier performance but also a streamlined, open-source-friendly workflow.

InternVL3 boldly sidesteps this legacy by adopting what its creators call "native multimodal pre-training." Instead of shoehorning vision into an already trained text model, InternVL3 jointly learns language and visual representations from the get-go, ingesting diverse multimodal data (image-text, video-text, and their interleavings) alongside large-scale pure text corpora in a single, unified training stage. This holistic approach helps the model internalize the complex interplay between vision and language without the need for cumbersome bridging modules or alignment hacks.

The benefits are tangible. InternVL3-78B, the flagship 78-billion parameter variant, clocks a score of 72.2 on the MMMU benchmark—a new state-of-the-art among open-source MLLMs—and performs competitively with heavyweight proprietary rivals like ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, all while retaining robust pure-language capabilities.

Achieving high multimodal performance isn’t just about training data and model size. InternVL3 incorporates several technical innovations to push its capabilities further:

Variable Visual Position Encoding (V2PE):
Unlike fixed positional embeddings that can struggle with long or complex visual contexts, V2PE uses smaller, adaptable increments for visual tokens. This upgrade improves the model’s understanding of extended multimodal contexts, such as multi-image inputs and videos, enabling more nuanced spatial and temporal reasoning.

InternVL3 sticks with the proven ViT-MLP-LLM architecture lineage from its predecessors but relaunches its vision components with a newly incrementally pre-trained InternViT and integrates them with state-of-the-art language models like InternLM 3 and Qwen 2.5. The model also supports dynamic resolution strategies, processing images tiled into 448×448 pixel blocks, and now natively accommodates multi-image and video inputs.

True to open-science principles, the InternVL3 team is releasing both the training datasets and model weights publicly on Hugging Face, encouraging the community to build, experiment, and fine-tune. The model supports 16-bit precision as well as 8-bit quantization for more efficient deployment, with detailed instructions and code snippets provided for multi-GPU setups, batch inference, video processing, and even streaming outputs.

The ecosystem around InternVL3 is further enriched by compatibility with LMDeploy, a toolkit that simplifies compressing and serving large multimodal models with an interface familiar to those used to LLM deployment pipelines. Whether you want to run multi-image conversations, batch prompts, or multi-turn dialogues with images and videos, InternVL3’s tooling aims to make that a smooth experience.

While commercial multimodal giants like GPT-4V have captured the limelight, open-source efforts often lag behind in performance or usability. InternVL3’s native multimodal training paradigm and smart architectural tweaks deliver a serious leap forward, narrowing the gap and democratizing access to cutting-edge multimodal capabilities.

For researchers, developers, and enthusiasts, InternVL3 represents a rare combination of open access, strong out-of-the-box performance, and extensibility for specialized applications—be it industrial image analysis, GUI automation, or advanced multimodal reasoning.

If you’re curious to see what a state-of-the-art open-source multimodal model looks like in 2025, InternVL3 is worth a deep dive. The code, weights, and comprehensive documentation are all available at Hugging Face’s InternVL3 repository, making it easy to get hands-on and experiment with the future of vision-language AI.

Disclosure: InternVL3 uses pre-trained components licensed under Qwen License and is released under the MIT License, supporting broad adoption and academic use.