An artificial intelligence reasoning model developed by OpenAI outperformed experienced emergency room physicians across a sweeping set of clinical tests — including real-world triage cases — according to a study published April 30 in the journal Science. The findings, led by researchers at Harvard Medical School and Beth Israel Deaconess Medical Center in Boston, represent one of the most rigorous head-to-head comparisons ever conducted between AI and practicing doctors.
The model, known as OpenAI o1 preview, was evaluated across six categories of clinical reasoning tasks. In each test, the AI matched or exceeded a baseline drawn from hundreds of board-certified, actively practicing physicians. At the early stages of emergency department triage — when physicians have the least patient information available — the model correctly identified the diagnosis in roughly 67 percent of real-world cases, compared with 50 to 55 percent accuracy for experienced attending physicians. Evaluators who reviewed the assessments did not know whether a given triage decision came from the AI model or from one of two expert physicians.
The study covered 76 live emergency department cases drawn from electronic health records at a Boston hospital. The model worked entirely from that raw text data — the same messy, fragmented records available to human clinicians at each decision point. Researchers also tested the AI against complex case reports published in The New England Journal of Medicine since 1959, a set historically considered the most demanding benchmark in clinical decision support. The AI cleared that benchmark as well, outpacing both prior AI generations and human physicians on what the study called management reasoning — choosing next steps in care, including antibiotic use and end-of-life planning.
Co-senior author Arjun Manrai, an assistant professor of biomedical informatics at Harvard Medical School and deputy editor of NEJM AI, stressed that the results do not mean AI is ready to replace physicians. Rather, the team called for controlled clinical trials to determine exactly how and where the technology can improve outcomes alongside human caregivers. Co-first author Peter Brodeur, an HMS clinical fellow, noted that management reasoning requires weighing not just objective data but a patient’s context, circumstances, and goals — dimensions where AI showed the most dramatic improvement over both prior models and unaided humans.
For healthcare systems in the Upstate, the study lands at a moment of active investment in precision medicine technology. Spartanburg Regional Healthcare System, the region’s largest hospital network, has deployed the 2bPrecise Enterprise platform across its six hospitals and more than 150 clinics — a system that integrates AI-assisted genomic decision support for its 700 physicians. Spartanburg Medical Center was also the first Upstate South Carolina facility to deploy the HYDROS Aquablation Therapy device, which uses AI-powered treatment planning for prostate procedures. Prisma Health, which serves patients across the Upstate and Midlands, has established a partnership with Siemens Healthineers to deploy AI-assisted diagnostic algorithms and clinical decision support across its 21-county service area in South Carolina.
Study authors and outside commentators cautioned that the AI model worked only from text data — it had no access to imaging, physical examination findings, or the nonverbal cues clinicians rely on in practice. Those limitations mean the gap between AI performance on reasoning benchmarks and AI performance at the bedside remains real. Still, Manrai described the moment as a profound shift: the tools are now improving fast enough that the field must determine — through rigorous prospective trials — where AI assistance genuinely helps patients and where human judgment remains irreplaceable.