
Published: June 02, 2026, 23:02 UTC
A small inference company has done what the major cloud providers couldn’t easily offer: running DeepSeek’s state-of-the-art V4 Flash model on AMD’s MI300X accelerators. The breakthrough — documented in a detailed worklog by Doubleword engineer Fergus Finn — suggests that AMD’s long-neglected first-generation AI silicon may be a viable, cheaper alternative to Nvidia’s supply-constrained H100s for open-weight model inference.
The hardware gap that sparked the effort. AMD’s MI300X, launched in December 2023, packs 192 GB of HBM3 memory per card — more than double the H100’s 80 GB — with 5.3 TB/s memory bandwidth and 2.61 PFLOPS of FP8 compute. List price is roughly half the H100’s. Yet as of early May 2026, running vLLM with DeepSeek V4 Flash on MI300X simply didn’t work ([Fergus Finn](https://fergusfinn.com/blog/deepseek-v4-flash-mi300x/)).
The problem wasn’t compute power; it was software. While AMD’s newer MI350X and MI355X chips have benefitted from a focused software push, support for the older MI300X languished. Nvidia H100 prices, meanwhile, climbed 40% in five months on one-year rentals, with on-demand capacity sold out across every major cloud provider.
Why MI300X was incompatible. The root cause traces back to a format war over FP8 arithmetic. AMD and Graphcore proposed one FP8 standard in a 2022 preprint, backed by Qualcomm. Arm, Intel, and Nvidia proposed another through the Open Compute Project. AMD’s side lost: newer AMD chips (MI325, MI350, MI355X) all moved to the OCP-standard FP8. But the MI300X remained locked to the older “fnuz” dialect — meaning “finite, nans, unsigned zero” — with zero, negative zero, and infinity handled differently.
The two FP8 dialects share their bit layout but differ in exponent bias by one. Reading the same byte in the wrong dialect produces results off by exactly a factor of two — silent, deterministic, and catastrophic ([Fergus Finn](https://fergusfinn.com/blog/deepseek-v4-flash-mi300x/)).
The fixes. Finn and the Doubleword team patched vLLM’s DeepSeek V4 compressor and fused compress/quant/cache-write paths to use the platform FP8 dtype so scales and cache bytes agree. They also routed the sliding-window K-cache through a fnuz-aware fused quantise-and-insert helper — commits 236de4e64 and bd06e5d87 in a public vLLM fork. Additional work addressed missing attention fast paths and HIP graph support for AMD’s ROCm stack.
The payoff. After tuning, an 8-GPU MI300X node running DeepSeek V4 Flash achieved approximately 2,485 output tokens per second per GPU initially, rising to around 2,699 tok/s per GPU after optimization — roughly 21,592 tok/s aggregate across the node.
DeepSeek V4 Flash itself is a 284-billion-parameter mixture-of-experts model with only 13 billion active parameters per token, supporting a 1-million-token context window. Its aggressive FP4 (expert weights) and FP8 (attention/norm/router) mixed precision makes it a near-ideal workload for MI300X’s high-bandwidth, large-memory architecture ([AI News](https://www.artificialintelligence-news.com/news/deepseek-v4-flash-released-mit-license-open-source/)).
Why this matters for the broader market. The successful bring-up carries three implications.
First, cost arbitrage. MI300X rents for roughly half the H100’s hourly rate while offering more than double the memory — crucial for memory-bound MoE inference at long contexts. With H100/H200 on-demand sold out cloud-wide, MI300X is available today ([Fergus Finn](https://fergusfinn.com/blog/deepseek-v4-flash-mi300x/)).
Second, software gap narrowing. The conventional wisdom has been that AMD’s ROCm stack trails Nvidia’s CUDA by 20 to 30 percent in kernel performance. This work shows that gap can be closed through targeted engineering — and that the fixes benefit the entire AMD inference ecosystem since all vLLM patches were contributed publicly.
Third, model architecture matters. DeepSeek V4 Flash’s 13-billion active parameter design and FP4/FP8 mixed precision fit MI300X like a glove. The model was designed to be efficient on hardware with large memory pools and compute constrained by bandwidth — which describes MI300X exactly.
The open question is whether AMD’s newer silicon — the MI350X and MI355X now tracked by SemiAnalysis’s InferenceX dashboard — can deliver on the software compatibility that MI300X needed two years to achieve. For now, Doubleword’s worklog is a proof that the gap is surmountable, not structural.
Sources: [Fergus Finn — Bringing up DeepSeek-V4-Flash on AMD MI300X](https://fergusfinn.com/blog/deepseek-v4-flash-mi300x/) (June 1, 2026); [AI News — DeepSeek V4 Flash released under MIT license](https://www.artificialintelligence-news.com/news/deepseek-v4-flash-released-mit-license-open-source/) (April 24, 2026); [SemiAnalysis — The Great GPU Shortage: Rental Capacity](https://semianalysis.com/) (April 2026)

