Google DeepMind Releases DiffusionGemma, an Open Model That Generates Text 4x Faster

Google DeepMind Releases DiffusionGemma, an Open Model That Generates Text 4x Faster

Google DeepMind has released DiffusionGemma, the first open-weight large-scale language model to use text diffusion instead of the standard autoregressive approach. The model generates text up to four times faster than a conventional transformer of the same size, though at a measurable cost in quality.

The release, announced on June 10 in partnership with NVIDIA, represents a meaningful shift in how local AI inference could work. Instead of producing one token at a time left to right, DiffusionGemma starts with a 256-token canvas of random placeholder tokens and denoises them in parallel, similar to how image generators like Stable Diffusion reconstruct images from noise.

### How It Works

DiffusionGemma is built on the same Gemma 4 26B Mixture-of-Experts backbone as Google’s standard autoregressive model, but it replaces sequential token generation with a process Google calls “Uniform State Diffusion.” The model writes roughly 15 to 20 tokens per forward pass across the full canvas, iterating up to 48 denoising steps with adaptive early stopping. For sequences longer than 256 tokens, finished segments commit to a KV cache and a new canvas initializes conditioned on prior context.

The key insight is architectural. Autoregressive models are bottlenecked by memory bandwidth: each token requires reloading model weights from HBM into compute units, and on single-user local inference those weights sit idle most of the time. Diffusion reverses this. The parallel canvas approach keeps tensor cores saturated, shifting the bottleneck from memory bandwidth to raw compute. This is the same principle that made Stable Diffusion practical on consumer GPUs.

### The Speed Numbers

On an NVIDIA H100 running FP8 precision, DiffusionGemma achieves over 1,000 tokens per second, roughly four times the throughput of the equivalent autoregressive Gemma 4 26B model. On a consumer GeForce RTX 5090 with 18 gigabytes of VRAM using quantization, it reaches over 700 tokens per second. NVIDIA’s DGX Spark, a $3,000-class personal AI workstation using the Grace Blackwell GB10 chip, delivers around 150 tokens per second. The DGX Station peaks at up to 2,000 tokens per second.

Google is explicit that this speedup applies to local, single-user inference. In high-throughput cloud serving, autoregressive models can saturate compute efficiently through request batching, and parallel decoding offers diminishing returns.

### The Quality Trade-Off

DiffusionGemma trails its autoregressive counterpart on every published benchmark. On MMLU Pro it scores 77.6 percent versus Gemma 4 26B’s 82.6 percent. On GPQA Diamond the gap is wider: 73.2 percent against 82.3 percent. On MMMU Pro, which tests multimodal reasoning, DiffusionGemma scores 54.3 percent compared to 73.8 percent.

Google’s researchers are upfront about this. “DiffusionGemma’s overall output quality is lower than standard Gemma 4,” the team wrote in the announcement. “For applications that demand maximum quality, we recommend deploying standard Gemma 4.” The model is positioned as an experimental complement, not a replacement. Choose speed for latency-sensitive tasks and fall back to the autoregressive version when accuracy matters most.

### Open Weights, Apache 2.0 License

DiffusionGemma is released under the Apache 2.0 license, which is more permissive than the standard Gemma license terms. It permits commercial use, modification, and redistribution without restriction. The model is available on Hugging Face as `google/diffusiongemma-26B-A4B-it`, with a quantized NVFP4 variant from NVIDIA and a GGUF version from Unsloth.

It supports text, image, and video input (up to 60 seconds) with output to text, and covers 140 languages matching Gemma 4’s language coverage. The context window extends to 256,000 tokens. Day-zero support launched in Hugging Face Transformers, vLLM, Unsloth, and MLX for Apple Silicon.

### What This Means for Local AI

The practical implication is that a 26-billion-parameter model can now run on a consumer GPU at speeds previously associated with cloud-tier accelerators. For multi-step agentic workflows where each inference step’s latency compounds, a 4x speedup translates directly to faster response times. Fine-tuned versions of DiffusionGemma have shown the ability to solve structured tasks like Sudoku in 12 denoising steps instead of the baseline 48, maintaining the speed advantage.

The model’s 18-gigabyte quantized footprint and Apache 2.0 license also make it attractive for air-gapped deployments in finance, healthcare, legal, and defense, sectors that need capable local inference without sending data to third-party APIs.

Google has been careful to set expectations. DiffusionGemma is an experimental model and the company continues to recommend the autoregressive Gemma 4 for quality-critical production use. But for the first time, developers have a practical choice between speed and accuracy using the same underlying architecture, in open weights, on their own hardware.


Sources: Google Blog (June 10, 2026); NVIDIA Blog (June 10, 2026); Ars Technica (June 10, 2026); Hugging Face (June 10, 2026)

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top