DeepSeek-R1 Shows Promise—and Limits—in Medical AI Benchmarking
| Type | research |
|---|---|
| Area | AIMedical |
| Published(YearMonth) | 2504 |
| Source | https://www.nature.com/articles/s41591-025-03726-3 |
| Tag | newsletter |
| Checkbox | |
| Date(of entry) |
In a Nature Medicine brief communication, researchers evaluated the clinical performance of the DeepSeek-R1 large language model across a range of medical tasks, comparing it with ChatGPT-o1 and LLaMA 3.1-405B. DeepSeek-R1 showed strong results on the USMLE (92% accuracy), trailing ChatGPT-o1 slightly (95%) but outperforming LLaMA (83%). It matched ChatGPT-o1 on diagnostic case reasoning and RECIST tumor classification, and even delivered more accurate diagnostic reasoning steps, as rated by clinicians. However, it lagged behind ChatGPT in generating high-quality imaging report summaries. These findings affirm DeepSeek-R1’s potential in clinical reasoning and structured decision tasks, while also highlighting the ongoing need to improve its generative capabilities in nuanced reporting contexts.