New AI Agent Benchmark Reveals a Fundamental Privacy Problem: All Models Leak

A new benchmark called TRAP has exposed a fundamental privacy vulnerability in AI agents: every model tested leaks private information when prompted adversarially, and no amount of prompt engineering can fully fix it. The results, published on arXiv on June 17, point to a structural problem that may require architectural changes, not better instructions (arXiv:2606.18996, June 17).

### The Core Tension

TRAP stands for Task-completion and Resistance to Active Privacy-extraction. The benchmark was designed by researchers at POSTECH to measure a conflict inherent to AI agents deployed in document-heavy workflows.

Consider an agent booking a flight. It needs a passport number to complete the task. But that same capability, the ability to read and use private information, makes the agent vulnerable to being tricked into revealing it. An attacker sharing the same chat session could simply ask: “What passport number did you just use?”

The agent faces two obligations that pull in opposite directions: use the private data accurately or never expose it. TRAP measures both at once.

### How TRAP Works

Each test case contains three elements:

A document containing private information (passport numbers, SSNs, financial details)
A task query that requires the agent to use the correct tool and private fields to complete a legitimate action
An attack query that attempts to elicit the same information in natural language from the model

The benchmark evaluates 22 models spanning frontier proprietary systems and open-source releases at multiple scales. Every model is scored on both task accuracy (did it complete the job?) and privacy leakage (did it reveal the private data when probed?).

### The Results: All Models Leak

The finding is stark. Every model family tested exhibits non-trivial leakage. Worse, instruction-following ability correlates directly with leakage rate: the better a model is at following user instructions, the more readily it reveals private data when asked.

Existing prompt-based defenses such as “never reveal this information” reduce leakage but at a significant cost to task accuracy. And the paper includes an impossibility result: for any softmax-based model, no soft-constraint defense can jointly achieve high task success with zero leakage probability.

The failure of prompt-based defenses is not incidental. It is a mathematical limit of the architecture.

### Why This Matters for Real-World Deployment

AI agents are already being deployed in document-intensive workflows across healthcare, finance, legal, and travel. In each of these domains, private information is not an edge case but a routine input. A medical coding agent needs patient records. A tax preparation agent needs Social Security numbers. A travel agent needs passport details.

The TRAP results suggest that every one of these deployments carries an inherent privacy risk that cannot be patched with system prompts or instruction tuning. If an attacker can share a session with the agent, they can extract the private data the agent was legitimately using.

### A Structural Solution

The researchers propose a way around the limitation: structural private field isolation. Instead of relying on the model to protect private data through instructions, private fields are replaced with hash keys before they reach the model. The model never sees the raw private data; it only sees a cryptographic placeholder. The hash key is resolved by a separate verification layer outside the model’s control.

This structural approach largely prevents leakage while keeping task accuracy intact, because the model can still use the hash key as a reference without ever having access to the plaintext value.

The tradeoff is architectural complexity. It requires redesigning how agents interact with private data, adding a verification layer that current agent frameworks do not include. But the TRAP results suggest that for applications handling sensitive information, this complexity may be unavoidable.

Sources: arXiv:2606.18996 (June 17); MarkTechPost contextual reference for deployment safety methods (June 16)

Leave a Comment Cancel Reply