Miami Startup Subquadratic Claims Breakthrough on LLMs’ Quadratic Bottleneck

A Miami-based startup called Subquadratic claims to have solved the most fundamental efficiency problem holding back large language models: the quadratic scaling of attention. Its model, SubQ, uses dynamic sparse attention to process up to 12 million tokens at roughly 2% of the cost of a comparable dense-attention model, a claim that has drawn both excitement and sharp skepticism from the AI community.

Subquadratic emerged from stealth in May with a bold thesis: the transformer architecture’s dense attention mechanism, which compares every token to every other token, wastes enormous compute on irrelevant relationships. SubQ’s architecture, which the company calls Sub-Quadratic Sparse Attention (SSA), selects only the token pairs that matter, computed on the fly for each input. CEO Justin Dangel describes it as “finding and focusing on the relationships that actually carry information.”

Subquadratic commissioned Appen, an independent third-party evaluator, to validate its claims. Appen’s director of GenAI research, Jeanine Sinanan-Singh, told MIT Technology Review the results were “really exciting” and that the architecture “could be a game changer.”

The Appen evaluation found SubQ to be 56 times faster than FlashAttention-2 on a one-million-token context. On the RULER 128K multi-task retrieval test, SubQ scored 99.12%. For long-context needle-in-a-haystack retrieval, it achieved 98% accuracy at 6 million tokens and 98% at 12 million tokens. On competitive coding benchmarks, SubQ posted 89.7% on LiveCodeBench, near frontier-level performance. A comparison on the RULER 128 test showed SubQ costing roughly US$8 versus US$2,600 for Anthropic’s Claude Opus 4.6 on the same task.

However, SubQ’s scores on knowledge-intensive benchmarks are less dominant. On GPQA Diamond, a graduate-level science reasoning test, SubQ scored 85.4%, competitive but trailing GPT-5.5 at 93.2% and Claude Opus 4.8 at 92%.

Noise to signal

The announcement has been met with significant skepticism. Former OpenAI researcher Will Depue told MIT Technology Review that “the public evidence does not yet justify the stronger claim that they have solved the quadratic attention bottleneck.” On Hacker News, commenters noted that Subquadratic reused weights from the open-source Chinese model Qwen, undermining the claim of reinventing the architecture from scratch. Others pointed to charts on the company’s website that minimised performance deltas against competitors, a practice one commenter called a “chart crime” that Subquadratic’s CTO Alex Whedon acknowledged was unintentional.

The company acknowledges the skepticism. Speaking on Hacker News, Whedon said the decision to delay publishing a full technical paper was driven by a desire to “see what else folks wanted and share more benchmarks.” The company has not yet widely released model weights or opened public API access beyond a waitlist of roughly 500 enterprise customers.

What’s at stake

The quadratic bottleneck, computation that grows with the square of input length, has been the primary obstacle to practical long-context language models. If Subquadratic’s dynamic sparse attention is real and generalisable, it would be the most significant architectural advance since the transformer itself. Every frontier lab, from OpenAI to Google DeepMind, has pursued sparse-attention variants; none has demonstrated a clean, production-viable solution at this scale.

Dangel is unambiguous about his bet: “We don’t think anybody will be building on transformers in a few years.” Whether that prediction is prescience or overreach depends on whether SubQ can survive the scrutiny that comes next.

Noise to signal

What’s at stake

Leave a Comment Cancel Reply