
A new paper from one of Europe’s leading AI law scholars argues that the entire field of legal AI evaluation has been measuring the wrong thing. The result is a regulatory blind spot that matters more with each passing day: the EU AI Act requires that high-risk judicial AI systems meet an accuracy standard, but no existing test can say whether a model actually reasons like a lawyer.
The paper, authored by Michèle Finck, professor of law and AI at the University of Tübingen and director of its Institute for AI and Law, is titled “The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act.” It was published on arXiv on June 16, 2026.
Current legal AI benchmarks measure things like binary outcome prediction (will the court rule for plaintiff or defendant?), document classification (is this a contract clause or a tort claim?), and multiple-choice bar exam questions. These are what Finck calls “paralegal tasks,” useful support work, but not the interpretive core of legal practice.
Doctrinal legal reasoning, the paper argues, has four structural features that no existing benchmark captures. First, internalism: the model must reason from within the legal order, using legal sources as its exclusive framework. Second, normativity: it must evaluate what the law requires, not what is statistically most frequent. Third, contestability: legal questions have defensible competing positions, not single right answers. Fourth, coherence: norms must hang together as a system, and the best reading is the one that fits the whole.
The paper surveys eight existing legal benchmarks, including LexGLUE (US and EU law), ECtHR-CASES (European Court of Human Rights), LegalBench (US law), GreekBarBench, LEXam (Swiss law), BenGER (German law), and Harvey LAB (US and UK). None of them test doctrinal reasoning. They test classification, retrieval, and task completion, which are important but fundamentally different from the kind of reasoning a judge or advocate performs.
The 21-failure taxonomy
Finck’s main constructive contribution is a taxonomy of 21 specific failure modes that a doctrinal legal reasoning benchmark must detect. The failures are organized into five categories.
Source recognition and authority failures: a model might not understand the hierarchy of norms (EU regulation trumps national law), or fail to recognize that EU law uses “autonomous concepts” that have their own meaning independent of any member state’s legal system. It might give too much weight to soft law or too little, or fail to resolve conflicts between differently authoritative sources. It also might not handle multilingualism properly, treating translations as equivalent when they carry different legal weight.
Operative doctrines failures: the model might not apply doctrines like direct effect (EU law creating individual rights that national courts must enforce) or supremacy properly. It might ignore procedural context or misunderstand the scope of application of a particular regulation.
Interpretive method failures: EU law uses distinctive interpretive methods, including teleological interpretation (interpreting the law by its purpose), that differ from common law approaches. A model might choose the wrong method, or default to a frequency-based answer when the correct answer requires a specific interpretive approach.
Temporal and contested reasoning failures: the law evolves over time, and a model that cannot distinguish a current precedent from an overruled one will give dangerously wrong answers. It also might present contested legal questions as settled, or fail to recognize when a question is genuinely unsettled (the acte clair doctrine in EU law).
Coherence failures: the model might not integrate a legal provision with related instruments, or might cite a source that contradicts its own conclusion.
Why it matters now
The paper’s most urgent claim is legal, not technical. The EU AI Act, whose high-risk obligations take effect on August 2, 2026, requires that providers of high-risk AI systems meet an “appropriate accuracy” standard. For judicial AI systems, Finck argues, accuracy cannot mean anything other than doctrinal reasoning quality. Article 15 of the Act even specifies that the European Commission “shall encourage the development of benchmarks” for this purpose.
If no benchmark exists to measure doctrinal reasoning, there is no way for a provider to certify that a system meets the Act’s accuracy requirement, and no way for a regulator to verify the claim. The measurement gap is also a compliance gap.
The paper does not present a ready-made benchmark dataset or leaderboard. Finck explicitly notes that no models were empirically evaluated. Her claim is that the framework must be built first, and the 21-failure taxonomy is the foundation. She notes informal testing and an opinion that “current models cannot, yet, reason doctrinally,” but acknowledges that without proper measurement, that claim cannot be settled.
The bigger picture
There is an industrial policy dimension as well. If EU law cannot be scored, AI investment flows to jurisdictions that can be measured. US law already has multiple benchmarks, however imperfect. The absence of an EU doctrinal reasoning benchmark is not just a regulatory gap. It is a competitive disadvantage for European legal AI companies trying to demonstrate their products work.
Finck is an authoritative voice on this subject. She holds the CZS Endowed Chair of Law and AI at Tübingen, has served on the Council of Europe’s Committee on AI, and authored the first sole-authored article-by-article commentary on the EU AI Act published by Oxford University Press in 2026.
Her paper arrives just weeks before the AI Act’s high-risk provisions become enforceable. Whether the Commission will take up the challenge of building the benchmark she describes is an open question. But the clock is running.
Sources: arXiv:2606.18158 (June 16, 2026); Michèle Finck, University of Tübingen; EU AI Act, Articles 6(3), 15

