Nemotron-H: Hybrid Mamba-Transformer Models Speed Up Large Language Model Inference Without Sacrificing Accuracy

In the ever-evolving landscape of large language models (LLMs), the race isn’t just about throwing more parameters at the problem—it’s about smarter architecture and efficiency at scale. Enter Nemotron-H, a fresh family of hybrid Mamba-Transformer models that promise to deliver state-of-the-art accuracy while slashing inference costs and boosting speed. Developed with a keen eye on practical deployment and scaling bottlenecks, Nemotron-H is a fascinating case study in the power of architectural innovation paired with clever training regimens.

The hybrid architecture replaces the majority of self-attention layers with Mamba-2 layers, a newer type of layer that operates with constant computation and memory per token generated. The upshot? The model’s inference-time complexity no longer grows with sequence length, enabling faster, more memory-efficient processing of long inputs.

Nemotron-H models are pre-trained on a gargantuan 20 trillion tokens of a high-quality, curated, and synthetically augmented dataset. This dataset blends web crawl data, mathematical content, code data, and academic texts. The synthetic data pipeline is particularly ingenious: low-quality crawl data is rephrased to reduce noise, while high-quality data is expanded through multiple prompt-driven transformations (e.g., diverse QA pairs, distillation, knowledge extraction).

Nemotron-H employs an FP8-based mixed precision training recipe. Despite a minor increase in training loss curves compared to BF16, downstream task performance matches or even exceeds BF16-trained models.

Nemotron-H’s hybrid design translates to impressive real-world gains. On NVIDIA H100 GPUs with extremely long input sequences (up to 65,536 tokens) and 1,024 output tokens, Nemotron-H-56B delivers up to 3× higher inference throughput compared to similarly sized Transformer rivals like Qwen-2.5-72B and Llama-3.1-70B.

Large models often need tailoring to fit deployment constraints—especially on consumer hardware with limited memory. For Nemotron-H, the team developed MiniPuzzle, a novel compression framework that fuses lightweight pruning with neural architecture search (NAS).

Nemotron-H also extends into vision-language modeling, with 8B and 56B variants built on instruction-tuned or base Nemotron-H backbones. These models incorporate the InternViT-300M-V2.5 vision encoder and a two-layer FFN projector to map image tokens into the language model’s embedding space.

Nemotron-H models don’t just boast efficiency; they deliver top-tier performance across a broad suite of benchmarks.

Nemotron-H’s blend of architectural innovation, massive high-quality training data, and smart training recipes exemplifies a compelling direction for future LLM development. By marrying the efficiency of Mamba layers with the proven power of Transformers, and backing it all with advanced compression and precision techniques, Nemotron-H stands as a testament to the fact that bigger isn’t always slower—sometimes, smarter is faster.

You can expect Nemotron-H models to become a valuable tool in the open-source ecosystem, with planned releases supporting Hugging Face, NeMo, and Megatron-LM. For those chasing high-performance LLMs that don’t cripple your GPU or wallet, Nemotron-H is worth a close look.

For the full technical deep dive, see the original Nemotron-H report on arXiv.