OpenAI’s LifeSciBench: A 750-Task Benchmark Reveals How Far AI Still Is From Replacing Life Scientists

OpenAI has released LifeSciBench, a benchmark of 750 expert-authored tasks designed to measure whether AI models can actually perform real life-science research. The results show that even the best models pass barely more than a third of the tasks.

The benchmark, announced on June 17 and developed by 173 PhD-level scientists, tests models across seven biological domains and seven workflow categories, from evidence handling and experimental design to translational science and clinical reasoning. Each task is a free-response prompt grounded in real research scenarios, graded against granular expert-written rubrics averaging 25 criteria per task.

The key finding: no current AI model is close to replacing a professional life scientist.

LifeSciBench departs from conventional AI benchmarks in several ways. Instead of multiple-choice questions with clean answers, it presents open-ended research scenarios. A task might ask a model to interpret an incomplete Western blot, critique a surrogate-endpoint dossier for an FDA meeting, or design a manufacturing process for a gene therapy vector.

Fifty-three percent of tasks require interpreting at least one of 1,062 attached artifacts: figures, PDFs, tables, chemical structures, sequence files, and web references. Seventy-nine percent require multiple reasoning steps, averaging four per task. The 19,020 rubric criteria across all tasks award partial credit for specific facts, reasoning steps, and numeric answers within tolerance. A passing score requires 70% of total points.

The construction rigor is unusually high. Each task averaged six self-directed review cycles and at least two expert review rounds. A 453-person validation panel (97% PhDs, averaging 12 years of experience and 14 publications) reviewed the tasks, with overall inter-rater agreement exceeding 96%.

The Results

OpenAI tested five models in a single-turn setting with internet browsing permitted:

| Model | Normalized Score | Task Pass Rate (70%+) |

|—|—|—|

| GPT-Rosalind (OpenAI, domain-specialized) | 0.576 | 36.1% |

| GPT-5.5 (OpenAI, general frontier) | 0.519 | 25.7% |

| Gemini 3.1 Pro (Google) | 0.515 | 23.6% |

| GPT-5.4 (OpenAI) | 0.479 | 20.7% |

| Grok 4.3 (xAI) | 0.399 | 13.0% |

GPT-Rosalind, OpenAI’s domain-specialized life sciences model, led across the board but still passed only 36.1% of tasks. For 171 tasks (22.8%), no model managed a passing score at all. For 261 tasks (34.8%), even the best model scored below 20%.

GPT-Rosalind performed best on translation (bench-to-bedside reasoning, 0.712 mean score) and scientific communication (0.718), and weakest on design, optimization and prediction (30.7% pass rate) and analysis (30.3%).

Artifacts remain a major bottleneck: GPT-Rosalind’s text-only pass rate was 45.1%, which dropped to 28.1% when artifacts were included. GPT-5.5’s dropped from 29.9% to 21.9%.

Notably, OpenAI’s Claude models from Anthropic were not included in the evaluation.

What It Means

The benchmark was explicitly designed to show headroom. OpenAI’s announcement framed it as a tool for measuring progress, not declaring victory: “Agentic AI systems are becoming increasingly capable of performing scientific tasks. However, their usefulness to life science researchers depends on how well they handle the complexity of real research.”

The results confirm that domain-specialized models like GPT-Rosalind provide a measurable but not transformative advantage over general frontier models. The gap between GPT-Rosalind and GPT-5.5 was roughly 10 percentage points in pass rate. Against the broader challenge of 750 real-world research tasks, that advantage is meaningful but incremental.

OpenAI plans to release portions of LifeSciBench to independent third-party leaderboards for all frontier labs to use. For now, the benchmark serves as a reality check for claims about AI replacing human scientists: the best models still fail nearly two-thirds of expert-level research tasks. The headroom is by design, but it is also substantial.

Sources: MarkTechPost AI (June 17, 2026); OpenAI blog (June 17, 2026); R&D World Q&A with OpenAI leads (June 17, 2026)

The Results

What It Means

Leave a Comment Cancel Reply