Benchmarks in drugs: the promise and pitfalls of evaluating AI instruments with mismatched yardsticks

In Might 2024, OpenAI launched HealthBench, a brand new benchmarking system to check the scientific capabilities of huge language fashions (LLMs) comparable to ChatGPT. On the floor, this may increasingly sound like one more technical replace. However for the medical world, it marked an necessary second—a quiet acknowledgement that our present methods of evaluating medical AI are essentially unsuitable.
Headlines within the latest previous have trumpeted that AI “outperforms medical doctors” or “aces medical exams.” The impression that’s coming by is these fashions are smarter, quicker, and even perhaps safer. However this hype masks a deeper fact. To place it plainly, the benchmarks used to reach at these claims are based mostly on exams constructed for evaluating human reminiscence retention from classroom teachings. They reward truth recall, not scientific judgment.
A calculator drawback
A calculator can multiply two six-digit numbers inside seconds. Spectacular, little doubt. However does this imply calculators are higher than, and perceive maths greater than arithmetic specialists ? Or higher even than an extraordinary one that takes a couple of minutes to do the calculation with a pen and paper?
Language fashions are celebrated as a result of they’ll churn out textbook-style solutions to MCQs and fill within the blanks for medical details and questions quicker than medical professors. However the follow of medication will not be a quiz. Actual medical doctors take care of ambiguity, emotion, and decision-making beneath uncertainty. They pay attention, observe, and adapt.
The irony is that whereas AI beats medical doctors in answering questions, it nonetheless struggles to generate the very case vignettes that type the premise of these questions. Writing an excellent scientific state of affairs from actual sufferers in scientific follow requires understanding human struggling, filtering irrelevant particulars, and framing the diagnostic dilemma with context. To this point, that is still a deeply human means.
Additionally Learn: Why AI in healthcare wants stringent security protocols
What current benchmarks miss
Most generally-used benchmarks—MedQA, PubMedQA, MultiMedQA—pose structured questions with one “appropriate” reply or have fill within the blanks questions. They consider factual accuracy however overlook human nuance. A affected person doesn’t say, “I’ve been utilizing a defective chair and sitting within the unsuitable posture for lengthy hours and have a non-specific backache ever since I purchased it. So please select the most effective prognosis and provides acceptable therapy.” They simply say, “Physician, I’m drained. I don’t really feel like myself.” That’s the place the actual work begins.
Medical environments are messy. Medical doctors take care of overlapping diseases, imprecise signs, incomplete notes, and sufferers who could also be unable—or unwilling—to inform the complete story. Communication gaps, emotional misery, and even socio-cultural components affect how care unfolds. And but, our analysis metrics proceed to search for precision, readability, and correctness—issues that the actual world not often gives.
Benchmarking vs actuality
It may be simple to resolve who the most effective batter on the earth is, by solely counting runs scored. Equally, bowlers could be ranked by the variety of wickets taken. However answering the query “Who’s the most effective fielder?” won’t be as easy. Measuring fielding could be very subjective and evades easy numbers. The variety of runs outs assisted or catches taken solely tells a part of the story. The efforts made on the boundary line to scale back runs or mere intimidation by the presence of the fielders (like Jonty Rhodes or R. Jadeja) stopping runs at covers or factors can’t be measured simply.
Healthcare is like fielding: it’s qualitative, typically invisible, deeply contextual, and laborious to quantify. Any benchmark that pretends in any other case will mislead greater than it illuminates.
This isn’t a brand new drawback. In 1946, the civil servant Sir Joseph Bhore, when consulted to reform India’s healthcare mentioned, “If it have been attainable to judge the loss, which this nation yearly suffers by the avoidable waste of helpful human materials and the reducing of human effectivity by malnutrition and preventable morbidity, we really feel that the end result can be so startling that the entire nation can be aroused and wouldn’t relaxation till a radical change had been led to”. This quote displays a longstanding dilemma—how you can measure what really issues in well being techniques. Even after 80 years, we’ve got not discovered excellent analysis metrics.

What HealthBench does
HealthBench no less than acknowledges this disconnect. Developed by OpenAI in collaboration with clinicians, it strikes away from conventional multiple-choice codecs. Additionally it is the primary benchmark to explicitly rating responses utilizing 48,562 distinctive rubric standardsstarting from minus 10 to plus 10, reflecting some elements of real-world stakes of scientific decision-making. A dangerously unsuitable reply should be punished extra harshly than a mildly helpful one. This, lastly, mirrors drugs’s ethical panorama.
Even so, HealthBench has limitations. It evaluates efficiency throughout simply 5,000 “simulated” scientific instances, of which just one,000 are categorised as “tough.” That could be a vanishingly small slice of scientific complexity. Although commendably world, its doctor-rater pool consists of simply 262 physicians from 60 nations in 52 languages, with various skilled expertise and cultural backgrounds (three Physicians from India participated, and simulations from 11 Indian languages have been generated). HealthBench Onerous, a difficult subset of 1,000 instances, revealed that many current fashions scored zero—highlighting their incapability to deal with complicated scientific reasoning. Furthermore, these instances are nonetheless simulations. Thus, the benchmark is an enchancment, not a revolution.
Additionally Learn: Synthetic Intelligence in healthcare: what lies forward
Predictive AI’s collapse in the actual world
This isn’t nearly LLMs. Predictive fashions have confronted related failures. The sepsis prediction instrument, developed by EPIC to flag early indicators of sepsis, confirmed preliminary promise a number of years in the past. Nonetheless, as soon as deployed, it couldn’t meaningfully enhance outcomes. One other firm that claimed to have developed a detection algorithm for liver transplantation recipients folded quietly after its mannequin confirmed bias in opposition to younger sufferers in Britain. It failed in the actual world regardless of glowing performances on benchmark datasets. Why? As a result of predicting uncommon/vital occasions requires context-aware decision-making. A seemingly unknown determinant might result in unsuitable predictions and pointless ICU admissions. The price of error is excessive—and people typically bear it.
What makes an excellent benchmark?
A sturdy medical benchmark ought to meet 4 standards:
Symbolize actuality: Embrace incomplete data, contradictory signs, and noisy environments.
Check communication: Measure how nicely a mannequin explains its reasoning, not simply what reply it offers.
Deal with edge instances: Consider efficiency on uncommon, ethically complicated, or emotionally charged situations.
Reward security over certainty: Penalise overconfident unsuitable solutions greater than humble uncertainty.
Presently, most benchmarks miss these standards. And with out these components, we danger trusting technically sensible however clinically naïve fashions.
Purple teaming the fashions
A method ahead is crimson teaming—a technique borrowed from cybersecurity, the place techniques are examined in opposition to ambiguous, edge-case, or morally complicated situations. For instance: a affected person in psychological misery whose signs could also be somatic; an undocumented unlawful immigrant fearful of exposing journey historical past; a baby with imprecise neurological signs and an anxious father or mother pushing for a CT scan; a pregnant girl with non secular objections to blood transfusion; a terminal most cancers affected person is not sure whether or not to pursue aggressive therapy or palliative care; a affected person feigning for private acquire.
In these edge instances, fashions should transcend data. They need to show judgment—or, on the very least, know after they don’t know. Purple teaming doesn’t exchange benchmarks. However it provides a deeper layer, exposing overconfidence, unsafe logic, or lack of cultural sensitivity. These flaws matter greater than ticking the fitting reply field in real-world drugs. Purple teaming forces fashions to disclose what they know and the way they assume. It uncovers these elements, which can be hidden in benchmark scores.

Why this issues
The core stress is that this: drugs is not only about getting solutions proper. It’s about getting individuals proper. Medical doctors are skilled to take care of doubts, deal with exceptions, and recognise cultural patterns not taught in books (medical doctors additionally miss rather a lot). AI, in contrast, is barely pretty much as good as the information it has seen and the questions it has been skilled on. HealthBench, for all its flaws, is a small however important course correction. It recognises that analysis wants to vary. It introduces a greater scoring rubric. It asks more durable questions. That makes it higher. However we should stay cautious. Healthcare will not be like picture recognition or language translation. A single incorrect mannequin output can imply a misplaced life and a ripple impact—misdiagnoses, lawsuits, knowledge breaches, and even well being crises. Within the age of information poisoning and mannequin hallucination, the stakes are existential.
The highway forward
We should cease asking if AI is best than medical doctors. That isn’t the fitting query. As an alternative, we should always ask: The place is AI secure, helpful, and moral to deploy—and the place is it not? Benchmarks, if thoughtfully redesigned, may also help reply that. AI in healthcare will not be a contest to win. It’s a duty to share. We should cease treating mannequin efficiency as a leaderboard sport and begin pondering of it as a security guidelines. Till then, AI can help. It could actually summarise. It could actually remind. Nonetheless, it can not exchange scientific judgment’s ethical and emotional weight. It actually can not sit beside a dying affected person and know when to talk and when to remain silent.
(Dr. C. Aravinda is an educational and public well being doctor. The views expressed are private. aravindaaiimsjr10@hotmail.com)
Revealed – June 12, 2025 07:30 am IST