Skywork-OR1: How 32 Billion Parameters Nearly Match a 671B Giant in Math and Code Reasoning

In the ever-expanding landscape of large language models (LLMs), the latest entrant from SkyworkAI—the Skywork-OR1 (Open Reasoner 1) series—is turning heads. Launched on April 13, 2025, this family of open-source models is not just another incremental improvement on existing architectures. Instead, it exemplifies a new breed of parameter-efficient reasoning engines, delivering performance that rivals behemoths with 20 times more parameters.

The OR1 series comprises three models: the Skywork-OR1-Math-7B, Skywork-OR1-7B-Preview, and Skywork-OR1-32B-Preview. All three are fine-tuned versions of DeepSeek’s distilled Qwen models—specifically, the 7B variants build on DeepSeek-R1-Distill-Qwen-7B, and the 32B on DeepSeek-R1-Distill-Qwen-32B. But the real magic lies in how SkyworkAI trained these models.

Using a large-scale rule-based reinforcement learning regimen, the team targeted mathematical and coding reasoning capabilities with surgical precision. The training dataset is impressively curated, featuring over 110,000 verifiable and challenging math problems alongside 14,000 coding questions drawn exclusively from open-source pools. Each problem undergoes a model-aware difficulty estimation, allowing the training process to focus dynamically on examples that are neither too easy nor insurmountably hard—avoiding the common pitfalls of “all correct” or “all incorrect” feedback that can stifle learning.

The training pipeline uses a custom flavor of Group Relative Policy Optimization (GRPO), augmented with multi-stage training phases, adaptive entropy control to balance exploration and convergence, and a custom fork of the VERL framework—all aimed at squeezing reasoning prowess out of relatively modest parameter counts.

When it comes to measuring reasoning ability, SkyworkAI eschews the common Pass@1 metric in favor of Avg@K—evaluating average performance over multiple attempts (32 for math tests, 4 for coding benchmarks). This gives a more stable, trustworthy read on how consistent and reliable the model is at cracking tough problems.

Here’s the showstopper: the Skywork-OR1-32B-Preview, with its 32.8 billion parameters, scores nearly identical to the gargantuan 671 billion parameter DeepSeek-R1 model on multiple benchmarks:

Model	AIME24 (Avg@32)	AIME25 (Avg@32)	LiveCodeBench (Avg@4)
DeepSeek-R1 (671B)	79.8	70.0	65.9
Skywork-OR1-32B-Preview	79.7	69.0	63.9

That’s right—Skywork achieves roughly the same math and code reasoning performance as a model over 20 times its size. This is a striking example of parameter efficiency, which can translate to large savings in computational cost, inference latency, and energy consumption.

For the smaller 7B models, the specialization pays off too. The Skywork-OR1-Math-7B, explicitly fine-tuned for math, outperforms its DeepSeek base by a wide margin (69.8 vs. 55.5 on AIME24), proving that targeted reinforcement training with carefully selected datasets can punch well above the weight class.

The OR1 models leverage BF16 tensor precision for a balance of numerical accuracy and performance, and are distributed in the safetensors format. Their underlying architecture is Qwen2, a solid foundation known for good performance on reasoning and code tasks.

The data pipeline is sophisticated, with offline and online filtering to prune suboptimal examples, rejection sampling to control training distribution, and multi-phase curriculum learning to build capabilities progressively. Such rigor in dataset design and training strategy is often what separates baseline fine-tuning from state-of-the-art model crafting.

Moreover, SkyworkAI open-sources not only the model weights but also the training data (Skywork-OR1-RL-Data), evaluation datasets, and code, fostering transparency and community engagement. Evaluation scripts are provided, with Docker and Conda environments detailed for reproducibility—a welcome gesture in an era where reproducibility can be elusive.

Skywork-OR1’s success is emblematic of a broader trend: rather than chasing ever-larger models, smart training techniques and curated data can bridge much of the performance gap. It highlights that open innovation and clever engineering can democratize access to powerful reasoning models without requiring the massive compute budgets of tech giants.

There’s also a strong community angle. With quantizations available for efficient local deployment (including llama.cpp-friendly GGUF formats), hobbyists and researchers with modest hardware can experiment with models that were once the exclusive domain of well-funded labs.

It’s important to note that the Skywork-OR1 models are described as “Preview” releases, with final versions and comprehensive technical reports due shortly. While the results are promising, the models build on distilled and fine-tuned foundations rather than being trained from pure scratch, raising questions about how much the underlying base contributes versus the reinforcement learning fine-tuning.

Also, as with many models trained on publicly available datasets, there’s the usual caveat about potential data leakage—especially since AIME problems and coding challenges circulate widely in the community. Yet, the team’s emphasis on verifiability and rigorous filtering suggests a conscientious approach.

SkyworkAI’s OR1 series is a strong demonstration that big reasoning chops don’t necessarily require big model sizes. By blending distilled architectures, large-scale rule-based reinforcement learning, and an expertly curated dataset, they’ve crafted models that deliver near state-of-the-art performance with only a fraction of the parameters that competitors wield.

For developers, researchers, and enthusiasts craving powerful yet efficient math and code reasoning models, the Skywork-OR1 series is definitely worth a spin—especially as the final releases and deeper documentation roll out in the coming weeks.

For those eager to dive in, the models and training data are available on SkyworkAI’s GitHub and Hugging Face repositories, complete with evaluation scripts and Docker containers to get you started.