DeepSeek open-sources DSpark, a speculative decoding framework speeding up V4 inference by up to 85 percent

DeepSeek has open-sourced DSpark, a speculative decoding framework that accelerates per-user text generation on its DeepSeek-V4 models by 60 to 85 percent without sacrificing output quality. The framework, along with the MIT-licensed DeepSpec training codebase, is available on GitHub and Hugging Face.

DSpark is a serving optimization, not a new model. The checkpoints DeepSeek-V4-Pro-DSpark and DeepSeek-V4-Flash-DSpark reuse existing V4 weights with an attached draft module, meaning deployment does not require retraining the base model.

How it works

Speculative decoding splits generation into two roles: a lightweight draft model proposes a block of candidate tokens, and the full target model verifies them in a single forward pass. Rejection sampling ensures the final output preserves the target distribution exactly, making the speedup lossless.

DSpark’s innovation lies in what it calls semi-autoregressive generation. Earlier drafters faced a trade-off: parallel drafters like DFlash are fast but suffer from decaying acceptance rates at later token positions, while autoregressive drafters like Eagle3 maintain quality but are slower. DSpark combines a heavy parallel backbone with a tiny sequential Markov head that adds a prefix-dependent bias before sampling each token. The sequential head adds only 0.2 to 1.3 percent per-round latency while improving accepted token length by up to 30 percent.

A confidence head estimates the survival probability for each drafted token, and a hardware-aware scheduler adjusts verification length based on GPU load. When GPUs are idle, the system verifies more tokens; when busy, it verifies fewer. Early stopping preserves losslessness.

Benchmark results

Offline benchmarks on Qwen3 and Gemma4 models showed DSpark delivering 26 to 31 percent longer accepted sequences than Eagle3 and 16 to 18 percent more than DFlash. A 2-layer DSpark configuration outperformed a 5-layer DFlash.

In production on DeepSeek-V4:

  • V4-Flash: 60 to 85 percent faster per-user generation over the MTP-1 baseline
  • V4-Pro: 57 to 78 percent faster

Throughput improvements ranged from 51 to 400 percent depending on concurrency levels, according to DeepSeek.

Open-source release

DeepSeek published the DSpark technical report alongside the DeepSpec codebase, which provides a standardized toolchain for training and evaluating speculative decoding drafters. The framework has been tested on open models including Gemma and Qwen, suggesting applicability beyond DeepSeek’s own ecosystem.

The release marks a significant step in making large-model inference more cost-efficient, particularly for high-concurrency production environments where per-user latency and total throughput are critical.

Sources: DeepSeek Releases DSpark (MarkTechPost, June 27, 2026); DSpark technical report (DeepSeek); DeepSpec GitHub (MIT license); 36Kr analysis (June 27, 2026)

Scroll to Top