DeepSeek-R1 Shows Promise—and Limits—in Medical AI Benchmarking

Typeresearch
AreaAIMedical
Published(YearMonth)2504
Sourcehttps://www.nature.com/articles/s41591-025-03726-3
Tagnewsletter
Checkbox
Date(of entry)

In a Nature Medicine brief communication, researchers evaluated the clinical performance of the DeepSeek-R1 large language model across a range of medical tasks, comparing it with ChatGPT-o1 and LLaMA 3.1-405B. DeepSeek-R1 showed strong results on the USMLE (92% accuracy), trailing ChatGPT-o1 slightly (95%) but outperforming LLaMA (83%). It matched ChatGPT-o1 on diagnostic case reasoning and RECIST tumor classification, and even delivered more accurate diagnostic reasoning steps, as rated by clinicians. However, it lagged behind ChatGPT in generating high-quality imaging report summaries. These findings affirm DeepSeek-R1’s potential in clinical reasoning and structured decision tasks, while also highlighting the ongoing need to improve its generative capabilities in nuanced reporting contexts.