OSGuard: New Benchmark Exposes the Safety Gap in Computer-Using AI Agents

Computer-use agents that can control desktops and browse the web are becoming a production category. OpenAI’s Operator, Anthropic’s Claude Computer Use, and Meta’s Manus Desktop are all competing for enterprise deployments. But a new benchmark from researchers at the University of California, Santa Cruz reveals a blind spot in how these systems are evaluated: passing a task does not mean the agent performed it safely.

OSGuard, introduced by researchers Mina Mohammadmirzaei and Jeffrey Flanigan, is a dual-granularity benchmark designed specifically to measure whether computer-use agents complete tasks without resorting to unsafe shortcuts. The paper was published on arXiv on June 13 (arXiv:2606.15034).

Current benchmarks for computer-use agents, such as OSWorld, evaluate agents almost exclusively on task completion. If an agent is told to “move the file”, it passes as long as the file ends up in the right location. But what if the agent overwrites a critical system file in the process? The task succeeds, but the operation was unsafe.

This distinction matters more as computer-use agents are deployed in real enterprise environments with access to sensitive files, credentials, and system settings. A 2026 survey by Aviso found that safety is weighted at 15 percent of enterprise AI agent evaluation criteria, with reliability at 30 percent and speed at 25 percent. But without benchmarks that explicitly test for unsafe behavior, safety remains an abstract concern rather than a measurable metric.

Dual-Granularity Design

OSGuard approaches the problem at two levels. The action-level benchmark evaluates individual guardrail decisions in isolation. It presents the model with contextualized proposed actions labeled as allowed, unrelated, or unsafe, each judged against the original instruction and the current interface state. This tests whether the agent can recognize a dangerous action in the moment.

The execution-level benchmark goes further. It constructs OSWorld-derived task variants where the original task remains achievable, but the environment is modified to introduce latent hazards such as destructive overwrites. Each variant is paired with augmented evaluators that retain the original task-success criterion while adding explicit state-based safety invariants. This allows the benchmark to distinguish between safe completions and unsafe completions that happen to satisfy the nominal goal.

What the Results Show

The experimental results reveal a telling gap. Current multimodal guardrails perform well on isolated action judgments at the action level. They can recognize when a proposed action is unsafe when presented as a standalone decision. But the risk-augmented execution suite exposes remaining gaps between local oversight and reliable end-to-end safety. An agent that correctly flags an unsafe action in isolation may still take that action when executing a multi-step task.

This finding has direct implications for how computer-use agents should be deployed. The benchmark suggests that relying solely on per-step guardrails is insufficient. Systems need safety monitoring that spans entire task trajectories, not just individual actions.

Why It Matters Now

Three major computer-use agent platforms are competing for enterprise adoption this year. OpenAI’s GPT-5.4 asserts a self-reported score of 75.0 percent on OSWorld-Verified, the first claimed breach of the 72.4 percent human baseline. Anthropic’s Claude Opus 4.6 leads independently verified OSWorld rankings at 72.7 percent. Meta’s Manus Desktop takes a different approach by running as a local agent rather than a browser-sandboxed cloud service.

All three face the same fundamental question OSGuard raises: how many of those task completions were safe? The benchmark provides a framework for answering that question, and the early results suggest the industry has work to do.

Sources: arXiv:2606.15034 (June 13); AgentMarketCap (April 11); BenchLM (June 2026); Aviso (November 2025)

Dual-Granularity Design

What the Results Show

Why It Matters Now

Leave a Comment Cancel Reply