With AI wearing ever-more hats in all kinds of workplaces, researchers are scrambling to devise tests that grade how well the technology actually performs in its myriad roles. In healthcare alone, several new benchmarks aim to gauge AI’s prowess in medical settings.
But none of the current tests look at how well AI can summarize real-world medical studies, a new report from startup Atropos Health claimed. The authors proposed a new framework to evaluate this skill on nine major models from Google, OpenAI, and Anthropic.
This kind of summarization is important because large linked databases of electronic health records (EHR) and automation tools have made real-world evidence more abundant and accessible, the authors wrote. As a company that uses AI to generate real-world evidence reports, Atropos has a vested interest here as well.
Report card: Google’s Gemini models were the clear winners among the nine models that researchers tested. But different models had various pros and cons.
All of the models provided summaries that “were generally judged to be complete.” Gemini 2.5 Pro performed the best at correctly describing direction of effect—i.e., whether outcomes were good or bad—and accuracy in numbers. Gemini 2.5 Flash and OpenAI’s o4-mini and o3 scored top marks on completeness. Gemini Flash 2.0 also won out on speed. Answers were all graded by a jury of their peers—three of the LLMs themselves.
The authors weighted each of these qualities into a final ranking in which Gemini 2.5 Pro (100%), Gemini 2.5 Flash (99%), and Gemini 2.0 Flash (94%) led the pack. OpenAI’s o4-mini and o3, as well as Claude 4 Sonnet, were in the middle, with some mixed scores on numerical accuracy, direction of effect, and speed. GPT-4.1, Claude 3.7 Sonnet, and Claude 4 Opus scored the worst, weighed down especially by a failure to reliably say whether an outcome was positive or negative.
The authors wrote that they gave this last ability the most priority in the final score, “reasoning that falsely reporting a protective drug effect as a harm (or vice versa) would be worse than inaccurate estimates or missing potentially important outcomes.”
A growing body: The authors proposed this new framework, RWESummary, as a possible addition to MedHELM, a more comprehensive medical AI benchmark set that Stanford University researchers released earlier this year.
While generative AI models can now ace standardized medical exams, researchers say this ability shows as much about a model’s preparedness for hospital use as a written driving test proves that a person is ready to jump behind the wheel. Benchmarks like MedHELM, OpenAI’s new HealthBench, and others aim to capture a fuller picture of different models’ abilities.
With the breakneck pace of AI development, Atropos researchers said they don’t expect any one model to reign supreme for any extended period of time, at least for now.
“The spaces of [real-world evidence] and LLMs are context dependent and evolving rapidly,” the authors wrote. “We do not believe there is likely to be a model to rule them all and for all time in the short to mid-term.”