Zur Kurzanzeige

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering

Comparative Case Study

dc.contributor.authorRoos, Jonas
dc.contributor.authorMartin, Ron
dc.contributor.authorKaczmarczyk, Robert
dc.date.accessioned2025-08-01T13:18:24Z
dc.date.available2025-08-01T13:18:24Z
dc.date.issued17.12.2024
dc.identifier.urihttps://hdl.handle.net/20.500.11811/13300
dc.description.abstractBackground: The rapid development of large language models (LLMs) such as OpenAI's ChatGPT has significantly impacted medical research and education. These models have shown potential in fields ranging from radiological imaging interpretation to medical licensing examination assistance. Recently, LLMs have been enhanced with image recognition capabilities.
Objective: This study aims to critically examine the effectiveness of these LLMs in medical diagnostics and training by assessing their accuracy and utility in answering image-based questions from medical licensing examinations.
Methods: This study analyzed 1070 image-based multiple-choice questions from the AMBOSS learning platform, divided into 605 in English and 465 in German. Customized prompts in both languages directed the models to interpret medical images and provide the most likely diagnosis. Student performance data were obtained from AMBOSS, including metrics such as the "student passed mean" and "majority vote". Statistical analysis was conducted using Python (Python Software Foundation), with key libraries for data manipulation and visualization.
Results: GPT-4 1106 Vision Preview (OpenAI) outperformed Bard Gemini Pro (Google), correctly answering 56.9% (609/1070) of questions compared to Bard's 44.6% (477/1070), a statistically significant difference (χ21=32.1, P<.001). However, GPT-4 1106 left 16.1% (172/1070) of questions unanswered, significantly higher than Bard's 4.1% (44/1070; χ21=83.1, P<.001). When considering only answered questions, GPT-4 1106's accuracy increased to 67.8% (609/898), surpassing both Bard (477/1026, 46.5%; χ21=87.7, P<.001) and the student passed mean of 63% (674/1070, SE 1.48%; χ21=4.8, P=.03). Language-specific analysis revealed both models performed better in German than English, with GPT-4 1106 showing greater accuracy in German (282/465, 60.65% vs 327/605, 54.1%; χ21=4.4, P=.04) and Bard Gemini Pro exhibiting a similar trend (255/465, 54.8% vs 222/605, 36.7%; χ21=34.3, P<.001). The student majority vote achieved an overall accuracy of 94.5% (1011/1070), significantly outperforming both artificial intelligence models (GPT-4 1106: χ21=408.5, P<.001; Bard Gemini Pro: χ21=626.6, P<.001).
Conclusions: Our study shows that GPT-4 1106 Vision Preview and Bard Gemini Pro have potential in medical visual question-answering tasks and to serve as a support for students. However, their performance varies depending on the language used, with a preference for German. They also have limitations in responding to non-English content. The accuracy rates, particularly when compared to student responses, highlight the potential of these models in medical education, yet the need for further optimization and understanding of their limitations in diverse linguistic contexts remains critical.
en
dc.format.extent10
dc.language.isoeng
dc.rightsNamensnennung 4.0 International
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectmedical education
dc.subjectimage analysis
dc.subjectlarge language model
dc.subjectLLM
dc.subjectstudent
dc.subjectperformance
dc.subjectcomparative
dc.subjectcase study
dc.subjectartificial intelligence
dc.subjectAI
dc.subjectChatGPT
dc.subjecteffectiveness
dc.subjectdiagnostic
dc.subjecttraining
dc.subjectaccuracy
dc.subjectutility
dc.subjectimage-based
dc.subjectquestion
dc.subjectimage
dc.subjectAMBOSS
dc.subjectEnglish
dc.subjectGerman
dc.subjectquestion and answer
dc.subjectPython
dc.subjectAI in health care
dc.subjecthealth care
dc.subject.ddc610 Medizin, Gesundheit
dc.titleEvaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering
dc.title.alternativeComparative Case Study
dc.typeWissenschaftlicher Artikel
dc.publisher.nameJMIR Publications
dc.publisher.locationToronto
dc.rights.accessRightsopenAccess
dcterms.bibliographicCitation.volume2024, vol. 8
dcterms.bibliographicCitation.issuee57592
dcterms.bibliographicCitation.pagestart1
dcterms.bibliographicCitation.pageend10
dc.relation.doihttps://doi.org/10.2196/57592
dcterms.bibliographicCitation.journaltitleJMIR Formative Research
ulbbn.pubtypeZweitveröffentlichung
dc.versionpublishedVersion
ulbbn.sponsorship.oaUnifundOA-Förderung Universität Bonn


Dateien zu dieser Ressource

Thumbnail

Das Dokument erscheint in:

Zur Kurzanzeige

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden:

Namensnennung 4.0 International