GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark for General Medical Artificial Intelligence

GMAI-MMBench, a comprehensive multimodal evaluation benchmark designed to test the capabilities of large Visual-Linguistic Models (LVLMs) in real-world clinical scenarios. Comprising 285 datasets, it covers 39 medical imaging modalities, 18 clinically relevant tasks, 18 departments, and 4 perceptual granularities, constructed in a Visual Question Answering (VQA) format. Additionally, a vocabulary tree structure is implemented, allowing users to customize evaluation tasks to meet various assessment needs and providing substantial support for medical artificial intelligence research and applications.

GMAI-MMBench is a significant comprehensive multimodal evaluation benchmark aimed at rigorously testing the capabilities of large Visual-Linguistic Models (LVLMs) in real-world clinical scenarios.

This evaluation benchmark is meticulously composed of up to 285 datasets, covering an extremely broad range. Specifically, it encompasses 39 different medical imaging modalities, including but not limited to X-ray images, CT scans, MRI images, and various other common and advanced medical imaging techniques. It also involves 18 clinically relevant tasks such as disease diagnosis, lesion identification, and treatment plan recommendations. Furthermore, it covers 18 different clinical departments, including internal medicine, surgery, obstetrics and gynecology, pediatrics, and other specialized fields. Additionally, it considers 4 perceptual granularities to assess the model's performance comprehensively.

GMAI-MMBench is constructed in a Visual Question Answering (VQA) format, making the evaluation process more intuitive and efficient. By posing questions and requiring models to provide accurate answers, it better tests the model's understanding and processing capabilities regarding medical images and related clinical information.

Moreover, GMAI-MMBench implements a unique vocabulary tree structure. This structure is highly flexible and practical, allowing users to customize evaluation tasks according to their specific needs. Whether it's for a particular medical imaging modality, a specific clinical task, or a specific perceptual granularity, users can utilize this vocabulary tree structure for personalized assessment settings, meeting a variety of different assessment needs. This feature provides substantial support for medical artificial intelligence research and applications, enabling researchers and practitioners to gain a deeper understanding and improve the performance of large Visual-Linguistic Models in clinical scenarios, contributing significantly to the advancement of medical artificial intelligence.