Medical Reasoning with Large Language Models: A Survey and MR-Bench

March 2026 in “ ArXiv.org ”

Xiaohan Ren, Chenxiao Fan, Wenyin Ma, Hongliang He, Chongming Gao, Xiaoyan Zhao, Fuli Feng

This document reviews the use of large language models (LLMs) in medical reasoning, emphasizing the need for robust reasoning beyond factual recall in clinical settings. It conceptualizes medical reasoning as involving abduction, deduction, and induction, and categorizes existing methods into seven technical routes. A cross-benchmark evaluation of medical reasoning models is conducted, revealing a significant gap between exam-level performance and real-world clinical decision-making accuracy. The introduction of MR-Bench, a benchmark based on real hospital data, further highlights this discrepancy. The survey provides a comprehensive overview of current methods, benchmarks, and evaluation practices, identifying key gaps in LLM performance for clinical reasoning.

View this study on arxiv.org →

Discuss this study in the Community →