New arXiv paper proves a fundamental ceiling on combining language models

A new paper on arXiv has proven a fundamental mathematical ceiling on the practice of combining multiple large language models to improve accuracy, and the findings challenge one of the field’s most common assumptions.

The paper, by researcher Josef Chen, analyses 67 frontier models from 21 providers and introduces the concept of the “co-failure ceiling” (β): the fraction of queries where every single model in the pool is wrong. For any multi-model system that selects one model’s answer per query, accuracy cannot exceed 1 minus β, no matter how sophisticated the routing, voting, or cascading strategy.

The ceiling is real and measurable

Across open-ended mathematics benchmarks, the observed co-failure rate was 5.2 percent, meaning that on 5.2 percent of all queries, every one of the 67 models was wrong simultaneously. This sets a hard upper bound of 94.8 percent accuracy that no ensemble technique can surpass with this pool.

The paper finds that standard statistical models significantly underpriced this co-failure risk. A Gaussian copula model predicted a β of only 2.3 percent, about 2.5 times lower than the real observed rate, with a 90 percent confidence interval of 1.7 to 3.4 times.

Diversity of failure, not number of models

The key insight: gains from combining models come from models failing on different questions, not from adding more models. At matched quality, low-correlation heterogeneous ensembles outperform high-correlation self-mixture-of-agents approaches. But even with diverse model pools, combining models rarely beats the single best model unless a strong query-level routing signal is available.

On execution-graded code tasks (k=17 models), the co-failure rate was 7.9 percent. On GPQA-Diamond benchmark questions converted from multiple-choice to free-response format, the rate jumped to 12.7 percent, showing that answer format can shift where the ceiling sits.

Practical implications

The paper recommends using a Clopper-Pearson confidence bound on β as a standard pre-training diagnostic, a certificate on the maximum possible gain from any router, vote, or cascade, before any training is done. The finding suggests that many AI teams investing in multi-model orchestration may be chasing gains that are structurally impossible to achieve.

The preprint has not yet been peer-reviewed.

Source: When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models (arXiv, June 2026)