Let me say the number plainly: 55.5. That's what DeepSeek-R1-Distill-Qwen-7B scores on AIME 2024, a standardised benchmark of extremely hard maths problems. To put that in context, GPT-4o sits at 9.3. Claude 3.5 Sonnet β€” widely considered the strongest general-purpose model of 2024 β€” sits at 16.0. DeepSeek-R1-Distill-7B, a model that fits in about 8GB of VRAM and runs natively on a consumer GPU, scores three times higher than the best non-reasoning models on the hardest maths benchmark in the world.

This is not a toy result. The full DeepSeek-R1-Distill-Qwen-14B hits 69.7, which puts it ahead of OpenAI's o1-mini on AIME β€” the model that cost $3 per million tokens to run via API, now matched by a 14-billion parameter checkpoint you can serve from your own machine.

The Idea That Shouldn't Have Worked

Before we get into benchmarks, it's worth pausing on what DeepSeek actually did, because it's genuinely surprising. The team trained DeepSeek-R1-Zero β€” a first pass at a reasoning model β€” using pure reinforcement learning on the base model, with no supervised fine-tuning step at all. No human-written reasoning chains. No curated "here's how you solve this" examples. Just: here's a problem, here's a reward signal, figure it out.

What emerged was a model that developed chain-of-thought reasoning, self-verification, and long reflection sequences entirely on its own. The paper shows examples of the model pausing mid-problem, checking its own work, backtracking, and restarting. These behaviours weren't taught β€” they were discovered because the reward structure made them advantageous.

This is the part that matters philosophically. It suggests that reasoning isn't something you can only inject by showing the model worked examples. Reasoning can be evoked, if you get the training dynamics right. The model figures out that thinking longer, checking its own logic, and exploring alternative approaches leads to higher rewards. That's a meaningful result for anyone building training pipelines.

DeepSeek-R1 itself refined this with cold-start data β€” a small amount of human-curated reasoning to give the RL process better initialisation β€” but the Zero result is the proof of concept that reasoning can be discovered, not just transferred.

What Distillation Actually Does

Here's the pipeline in plain English: DeepSeek took the full 671-billion-parameter DeepSeek-R1 (a Mixture-of-Experts model with 37B activated parameters at inference time), let it generate hundreds of thousands of reasoning traces across maths, code, and general knowledge problems, and then used those traces to fine-tune much smaller dense models β€” Qwen2.5 and Llama3 variants at 1.5B, 7B, 8B, 14B, 32B, and 70B scales.

The critical finding is that reasoning ability is compressible. You can take the pattern of how a 671B model thinks through a hard problem, distill it into the weights of a 7B model, and retain a substantial fraction of that capability. This is not obvious β€” you might expect that a small model would learn to mimic the surface behaviour of the large model without the underlying reasoning competence. The benchmarks suggest that isn't what's happening. The smaller models are genuinely better at reasoning tasks than models trained the traditional way at the same scale.

The distillation results table from the paper tells the story clearly:

DeepSeek-R1-Distill-Qwen-7B: AIME 2024 pass@1 of 55.5, MATH-500 at 92.8, Codeforces rating of 1189. This is a 7B model that out-rates GPT-4o on hard maths.

DeepSeek-R1-Distill-Qwen-14B: AIME 2024 pass@1 of 69.7, Codeforces rating of 1481. This model surpasses OpenAI o1-mini on AIME while being roughly 5Γ— smaller than the smallest reasoning model OpenAI has released publicly.

DeepSeek-R1-Distill-Qwen-32B: AIME at 72.6, Codeforces at 1691 β€” meaningfully outperforming o1-mini across the board, and matching what most people considered theε°– frontier of reasoning just eighteen months ago.

Running It Locally: What You Actually Need

The good news: these models are designed to run locally. DeepSeek explicitly engineered the distilled checkpoints to work with standard inference stacks β€” vLLM, SGLang, and by extension, Ollama and anything using the HuggingFace transformers interface.

For the 7B at Q4_K_M quantisation (the format most people use), you're looking at roughly 5-6GB of VRAM. The 14B at the same quantisation sits around 10-11GB. If you've got a 3060 12GB, you can run the 14B with comfortable headroom. The 7B runs on considerably less β€” useful if you're on a laptop or a card with less memory.

The catch β€” and there is a catch β€” is that reasoning models output more tokens per query than standard models. When you ask a normal 7B to write a response, it might generate 300-500 tokens. When you ask DeepSeek-R1-Distill-7B to solve a hard problem, it can and will output thousands of tokens of chain-of-thought before arriving at an answer. This means token generation speed (tokens/second) is a poor measure of quality. What matters is that the answer is right, and that the model doesn't give up halfway through a complex derivation.

In practice, generation speed is acceptable for interactive use. I've had the 7B running at 15-25 tokens per second on a 3060, which is perfectly usable for deliberate, thoughtful work β€” less so for high-throughput batch processing.

The Configuration That Makes the Difference

DeepSeek's own usage recommendations are worth reading carefully, because the model behaves differently from standard instruction-tuned models in ways that matter.

The temperature recommendation is 0.5–0.7, with 0.6 as the sweet spot. Below that, you risk the model entering repetitive loops β€” a known failure mode of reasoning models that haven't been as aggressively post-processed as commercial products. Above 0.7, you start losing the coherence that makes long chain-of-thought useful.

The more important point is the instruction. DeepSeek explicitly recommends that you structure your prompt so the model begins every response with \n. This isn't cosmetic. The reasoning model bypasses its own thinking pattern for certain query types if it isn't explicitly prompted to enter it β€” meaning you lose the reasoning capability you paid for. It's a small configuration detail that makes a large functional difference.

For maths problems specifically, including "Please reason step by step, and put your final answer within \boxed{}" in your prompt is the recommended approach. This aligns the model's output format with the evaluation benchmarks, and apparently also helps the model structure its own thinking more cleanly.

The Broader Shift This Represents

Reasoning models have been treated as a product category β€” something OpenAI owns and sells via API, priced at a premium. DeepSeek-R1 repositions the whole category. It says: reasoning isn't a proprietary capability that requires frontier-scale infrastructure. It's a training methodology, and once you understand the methodology, you can replicate it on dense models small enough to run at home.

This has immediate practical consequences. If you're building any system that needs reliable multi-step logic β€” code generation, mathematical reasoning, structured decision-making, agentic tool use β€” you now have access to models that perform at a level that was API-only territory eighteen months ago, running on hardware you own. The latency is higher than a fast inference call, but the cost structure is fundamentally different.

It also clarifies what matters going forward. Distillation is only as good as the teacher model, which means the full DeepSeek-R1 (671B MoE) remains important as a source of reasoning traces. But the practical access point β€” what most people will actually use β€” is the distilled 7B and 14B. The gap between those two is narrowing fast as quantisation improves and inference stacks get more efficient.

What we're seeing is the commoditisation of reasoning. Not "AI" in the broad, vague sense β€” specific, verifiable, multi-step logical competence, available to anyone with a GPU and an evening to set it up. That's worth writing about, because it's genuinely different from where we were even a year ago.