750 tasks that aren't "guess the right letter"
On June 17, 2026, OpenAI officially released LifeSciBench — a new benchmark for evaluating AI model capabilities in real-world scientific research tasks from the life sciences domain. It's not another set of multiple-choice questions from a biochemistry textbook. Each of the 750 tasks is formulated as a free-text response, exactly as a scientist would assign a task to a colleague — including attached data files, graphs, tables, and PDF documents.
The benchmark was created with the involvement of 173 expert scientists with PhDs, all with experience in the biotech or pharmaceutical industry. Each task underwent at least two rounds of expert review and an average of six automated revision cycles before acceptance. Quality was then independently verified by another 453 reviewers, 97% of whom held doctorates — and agreement on relevance, scientific accuracy, and usefulness of tasks exceeded 96% across all categories.
How evaluation works: rubrics, not scores
The key innovation of LifeSciBench is the rubric-based evaluation system. Instead of comparing against a single "correct" answer, each model response is evaluated against a detailed set of criteria — across the entire benchmark there are 19,020 of them, roughly 25 per task. Each criterion rewards a specific fact, computational step, scientific justification, or correctly identified limitation of a method.
Thanks to this, the benchmark measures two distinct metrics: normalized score (ratio of points earned to maximum, i.e. partial credit) and task success rate (proportion of tasks where the model achieved at least 70% of points). This is a fundamental difference — an answer can be partially correct and useful, yet not complete the task as a whole. This is exactly how real scientific work is evaluated.
Seven workflows, seven scientific domains
The tasks cover seven key research workflows: evidence work, analysis, design and optimization, scientific reasoning, validation and laboratory operations, translation (transferring findings into practice), and scientific communication. Thematically they range from genomics through medicinal chemistry to clinical and translational science.
Approximately 79% of tasks require multiple reasoning or decision-making steps (an average of 4 steps per task). Moreover, over half of the tasks (53%) require working with attached data artifacts — DNA sequences, chemical structures, graphs, tables, or PDF articles. In total, 1,062 artifacts are attached to the tasks.
How top models performed: reality vs. expectations
OpenAI tested five models in single-turn mode (one answer per one prompt, with internet access allowed). The results speak clearly:
| Model | Normalized score | Task success rate |
|---|---|---|
| GPT-Rosalind | 57.6% | 36.1% |
| GPT-5.5 | 51.9% | 25.7% |
| Gemini 3.1 Pro | 51.5% | 23.6% |
| GPT-5.4 | 47.9% | 20.7% |
| Grok 4.3 | 39.9% | 13.0% |
GPT-Rosalind, OpenAI's domain-specialized model, achieved the best results across all metrics — but even so, it succeeded in only 36.1% of tasks. Interestingly, Google's Gemini 3.1 Pro led (uniquely) on 214 tasks out of 750, showing that even the "second best" model can have specific strengths. Grok 4.3 from xAI lagged significantly behind with just 13% success rate.
For context: none of the tested models managed to pass 171 tasks (22.8% of the benchmark), and on 261 tasks (34.8%), even the best model had a success rate below 20%. The benchmark is far from saturated — in an era when common language benchmarks are falling like dominoes, this is refreshing news.
Where AI excels — and where it fails
The models' strongest aspects show up in tasks requiring structured scientific communication and synthesis of findings. GPT-Rosalind achieved 71.1% success in scientific communication (although this category contains only 9 tasks, so it should be taken with caution) and 57.7% in translation — i.e. transferring preclinical findings into clinical implications.
Conversely, design, optimization, and prediction remain extremely difficult for AI — GPT-Rosalind passed only 30.7% of tasks here. The drop is even more pronounced when working with data artifacts: GPT-Rosalind's success rate falls from 45.1% on purely text-based tasks to 28.1% on tasks with artifacts. Models have a clear problem extracting relevant information from graphs, sequence files, or complex tables and integrating it into the final answer.
Tasks requiring precise outputs fare particularly poorly — numerical values (14.8% success rate), sequences or structures (24.0%), and construction designs (27.3%). This shows that while current AI models can skillfully reason about scientific concepts, they fail when it comes to delivering an exact, laboratory-usable result.
What this means for the pharmaceutical industry — and for Czechia
LifeSciBench arrives at a time when pharmaceutical companies worldwide are massively investing in AI — from drug discovery to clinical trial optimization. But the benchmark results clearly show that AI is not yet ready to replace an experienced researcher. It can be a useful assistant — helping with literature searches, identifying patterns in data, or proposing hypotheses — but final decisions on experimental design or interpretation of results remain in human hands.
For the Czech and European context, this is doubly important. The European Union has a strong pharmaceutical base (from Novartis to local biotech startups like SOTIO or Apigenex) and at the same time belongs to the strictest regulatory environments in the world. A benchmark that evaluates whether AI understands regulatory requirements, clinical trial design, and interpretation of preclinical data is therefore exceptionally relevant for European companies. If you want to use AI in the drug approval process, you need certainty that the model correctly understands the context and limits of the data — and that's exactly what LifeSciBench tests.
Benchmark limitations: why not celebrate yet
It's fair to mention the limits too. LifeSciBench tests models only in single-turn mode — but real research is iterative: a scientist formulates a hypothesis, obtains data, adjusts the hypothesis, proposes another experiment. Furthermore, it's a benchmark created by OpenAI, which may raise questions about impartiality (though independent validation by 453 reviewers largely compensates for this). And finally — 750 tasks cannot cover all scientific specializations.
OpenAI itself states that strong performance on LifeSciBench should be interpreted as "evidence of realistic task-solving ability," not as a direct measure of research impact. The next step will be connecting benchmark results with studies in real research environments.
The full research paper and technical details are available on the official OpenAI blog and in the preprint (PDF). Detailed comparison was also provided by MarkTechPost.
Is LifeSciBench publicly available to researchers outside OpenAI?
OpenAI states that public release may be restricted by security and licensing conditions. Specific availability details have not yet been announced, but researchers can register as contributors or request access to the domain-specialized GPT-Rosalind model through forms on the OpenAI website.
How does LifeSciBench differ from better-known benchmarks like MMLU or GPQA?
Fundamentally. While MMLU or GPQA use mostly multiple-choice questions with a single correct answer, LifeSciBench works with free-text responses, evaluates them using rubrics with 25+ criteria per task, and requires working with real data artifacts (sequences, graphs, PDFs). It tests not only knowledge but also scientific reasoning ability, working with uncertainty, and communicating results.
Can LifeSciBench help with developing new drugs?
Not directly — it's an evaluation tool, not a research platform. But indirectly yes: it shows pharmaceutical companies where current AI models are truly useful (literature searches, synthesis of findings, communication) and where they still fall short (experimental design, precise calculations, working with complex data). This will help companies better decide which parts of the research process to involve AI in, and where to keep human control.