DeepSeek V4 Paper Reveals the Architecture Behind Million-Token Context at 10% the Cost

DeepSeek published the full technical report for its V4 model family on June 19, detailing the architectural innovations that make million-token context windows economically viable in production. The paper, released as an arXiv preprint, explains how the Chinese AI lab cut inference FLOPs to 27% and KV cache memory to 10% of its predecessor at one-million-token context length, while maintaining or exceeding benchmark performance across coding, math, and reasoning tasks.

The V4 family, which launched as a preview on April 24, consists of two Mixture-of-Experts models. DeepSeek V4 Pro packs 1.6 trillion total parameters with 49 billion activated per token, while V4 Flash uses 284 billion total parameters with 13 billion activated. Both support a full 1-million-token context window and are licensed under MIT, making them among the most permissively-licensed frontier-scale models available.

The headline architectural change is a hybrid attention mechanism that combines Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA). Standard transformer attention scales quadratically with sequence length, making million-token contexts prohibitively expensive for all but the wealthiest operators. DeepSeek’s approach replaces dense attention over the full sequence with a combination of sparse token selection and aggressive compression at the key-value level.

At one million tokens, V4 Pro requires only 27% of the inference FLOPs and 10% of the KV cache that DeepSeek V3.2 needed at the same context length. This is not a theoretical efficiency gain on an academic benchmark. It is the difference between a feature that lives in a press release and one that works in production at scale.

The paper also introduces Manifold-Constrained Hyper-Connections (mHC), a drop-in replacement for conventional residual connections that improves signal propagation stability across the model’s 61 transformer layers. DeepSeek adopted the Muon optimizer instead of the industry-standard AdamW, reporting faster convergence and greater training stability during the 32-trillion-token pre-training run.

Benchmark positioning and independent evaluation

DeepSeek reports V4 Pro-Max, the maximum reasoning effort mode, at 80.6% on SWE-bench Verified, 93.5% on LiveCodeBench, and a Codeforces rating of 3,206 (placing 23rd among human competitors globally). On GPQA Diamond, a graduate-level science reasoning benchmark, it scores 90.1%.

Independent verification from the U.S. National Institute of Standards and Technology’s CAISI unit, published May 1, tells a more nuanced story. CAISI evaluated V4 Pro across five domains using both public and non-public benchmarks, including its own PortBench for software engineering and a semi-private ARC-AGI-2 dataset for abstract reasoning. The agency found DeepSeek V4 to be the most capable Chinese AI model evaluated to date, but estimated its aggregate capability lags behind leading U.S. models by roughly eight months.

On CAISI’s held-out benchmarks, DeepSeek V4 Pro scored 74% on SWE-bench Verified (below its self-reported 80.6%), 44% on PortBench, and 46% on ARC-AGI-2 semi-private. For comparison, OpenAI’s GPT-5.5 scored 81%, 78%, and 79% respectively on the same evaluations. On mathematics benchmarks, DeepSeek held its own: 97% on OTIS-AIME-2025, tied with GPT-5.5 at 96% on PUMaC 2024.

The gap between self-reported and independent scores is not unusual. DeepSeek’s reported SWE-bench numbers use specific scaffolding and token budgets that may differ from CAISI’s standardized evaluation framework. The company’s own tests use established public benchmarks with well-documented methodologies. CAISI’s evaluations include non-public tasks designed to resist training-data contamination.

Pricing that reshaped the market

The most disruptive aspect of V4 may not be its architecture but its economics. On May 22, DeepSeek made a 75% discount on V4 Pro permanent. Current API pricing stands at $0.435 per million input tokens and $0.87 per million output tokens for V4 Pro, with auto-applied context caching dropping cache-hit input to $0.003625. V4 Flash costs $0.14 input and $0.28 output, with cache-hit input at $0.0028.

Compared to GPT-5.5 at $5 per million input tokens and $30 per million output, DeepSeek V4 Pro is roughly 11 times cheaper on input and 34 times cheaper on output. Against Claude Opus 4.7 at $5 input and $25 output, V4 Flash is roughly 35 times cheaper on input and 89 times cheaper on output. V4 Pro against Opus 4.7 works out to roughly 11 times cheaper on input and 29 times cheaper on output.

CAISI’s cost analysis, using pre-discount pricing, found DeepSeek V4 cost less than GPT-5.4 mini on five of seven benchmarks, ranging from 53% less to 41% more per correctly solved task.

The pricing gap has reshaped the economics of frontier-model access, particularly for high-volume users who can route simpler tasks to V4 Flash at roughly one-thirty-fifth the cost of the closest closed-source competitors.

The two-model strategy

DeepSeek’s decision to launch two models simultaneously, rather than a single flagship, reflects a maturing product strategy. V4 Pro targets the performance ceiling for complex reasoning, multi-step analysis, and agentic coding workflows where higher latency is acceptable. V4 Flash optimizes for latency-sensitive applications, high-volume chat, and cost-sensitive deployments where the smaller activated parameter count (13B vs 49B) delivers faster responses.

Both models share the same hybrid attention architecture, the same 1-million-token context window, and the same MIT license. The difference in total parameter count (1.6T vs 284B) means V4 Flash can run on far more modest hardware. At FP8 quantization, V4 Flash fits on a single H100 or comfortably on 4 x H200 GPUs. V4 Pro requires a minimum of 8 x H100s at FP8, or roughly $50,000 in GPU infrastructure for self-hosted deployment at full precision.

The practical implication is that developers can use V4 Flash for everyday coding assistance and document processing, switching to V4 Pro only for the hardest reasoning problems, while paying a fraction of what equivalent closed-source models charge for either tier.

Post-training pipeline

The paper details a two-stage post-training pipeline that DeepSeek calls “domain-expert cultivation” followed by “unified consolidation.” In the first stage, domain-specific experts are independently refined through supervised fine-tuning and reinforcement learning using Group Relative Policy Optimization (GRPO). In the second stage, these specialized capabilities are consolidated into a single model through on-policy distillation.

This approach allows DeepSeek to train experts optimized for coding, mathematics, scientific reasoning, and agentic tool use separately, then merge their capabilities without catastrophic forgetting. The consolidated model retains high performance across all domains rather than excelling at one at the expense of others.

The pipeline supports three reasoning modes: non-thinking for fast responses, think-high for detailed logical analysis, and think-max for maximum effort on the hardest problems. This reflects a broader industry trend away from single-mode models toward configurable reasoning budgets that users can tune to their specific latency and quality requirements.

What this means for the open-source landscape

DeepSeek V4 represents the strongest open-weights challenge to closed-source frontier models since the original R1 release in January 2025. The MIT license removes the last barrier to commercial adoption. Companies can download the weights, fine-tune them on proprietary data, deploy them on their own infrastructure, and redistribute modified versions without licensing fees or usage restrictions.

The NIST evaluation provides the most credible independent benchmark. The capability gap to U.S. frontier models is real but narrowing. On CAISI’s IRT-estimated Elo scale, DeepSeek V4 Pro scores 800 versus GPT-5.5 at 1,260. On cost per correctly solved task, DeepSeek leads on a majority of benchmarks. Organizations that need frontier-level performance on every task will still pay for GPT-5.5 or Claude Opus 4.7. Organizations that can tolerate a 5-10 percentage point capability gap for 95% cost savings have a viable open-source alternative.

DeepSeek has signaled that V4 is a preview, not a final release. The paper closes with a note that ongoing work is focused on multimodal capabilities and further improvements to long-context reasoning. The legacy model aliases `deepseek-chat` and `deepseek-reasoner` are scheduled for retirement on July 24, 2026, at which point V4 Flash will become the default endpoint on the DeepSeek API.


Sources: DeepSeek V4 technical report (arXiv, June 19, 2026); CAISI Evaluation of DeepSeek V4 Pro (NIST, May 1, 2026); DeepSeek API pricing (DeepSeek, accessed June 20, 2026); Together AI model page (Together AI, 2026); freedeepseekapi.com benchmarks (2026)

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top