Publications

Medcalc-Bench

Published in NeurIPS 2024 Datasets and Benchmark Track Oral, 2024

We propose MedCalc-Bench, a first-of-its-kind dataset focused on evaluating the medical calculation capability of LLMs. MedCalc-Bench contains an evaluation set of over 1000 manually reviewed instances from 55 different medical calculation tasks. Each instance in MedCalc-Bench consists of a patient note, a question requesting to compute a specific medical value, a ground truth answer, and a step-by-step explanation showing how the answer is obtained.

Recommended citation: Khandekar, N., Jin, Q., Xiong, G., Dunn, S., Applebaum, S. S., Anwar, Z., Sarfo-Gyamfi, M., Safranek, C. W., Anwar, A. A., Zhang, A., Gilson, A., Singer, M. B., Dave, A., Taylor, A., Zhang, A., Chen, Q., & Lu, Z. (2024). MedCalc-Bench: Evaluating Large Language Models for Medical Calculations. arXiv. https://arxiv.org/abs/2406.12036.
Download Paper

Soren Dunn

Publications

Medcalc-Bench

Agentless: Demystifying LLM-based Software Engineering Agents