AI Models Show Strong Vaccine Knowledge, but Gaps Remain Across Languages and Topics
A new study published in npj Vaccines has evaluated the accuracy of large language models (LLMs) in answering vaccine-related questions across multiple languages, revealing both impressive capabilities and critical limitations.
The Study at a Glance
Researchers developed VaxEval, a multilingual benchmark consisting of 1,886 multiple-choice questions covering 14 vaccines in English, Spanish, and Chinese.
Questions spanned a wide range of topics:
- Vaccination schedules
- Efficacy and safety
- Adverse effects
- Myths and misconceptions
- Access and cost
- Disease prevention
Sources included: WHO, CDC, UNICEF, Africa CDC, AMA, Immunize.org, and peer-reviewed literature.
Models and Methods Tested
A total of 13 LLMs were assessed:
- GPT-4.5, GPT-4o, GPT-4, GPT-3.5-Turbo
- Claude 3 Opus, Gemini 1.5 Pro
- Llama-4 Maverick, DeepSeek-V3, Grok-3
- Qwen 2.5, GLM-4, Reka Core, Yi-Lightning
Prompting methods included zero-shot, few-shot, and chain-of-thought approaches.
Key Results
Average accuracy: 86.0% (English), 83.7% (Spanish), 80.0% (Chinese).
GPT-4o achieved the highest overall accuracy at 90.3%, closely followed by Llama-4 Maverick (90.2%) and DeepSeek-V3 (89.6%).
Flagship models showed 57% higher odds of correct answers compared to earlier systems.
Prompting Strategy Matters
- Few-shot prompting increased the likelihood of correct responses by 17% compared to zero-shot.
- Chain-of-thought prompting was surprisingly associated with 21% lower odds of correctness.
Accuracy Varies by Vaccine and Topic
Best-performing vaccines:
Vaccine Accuracy Influenza 90.5% Hepatitis A 89.5% HPV 88.4% COVID-19 85.3%Most challenging vaccines:
Vaccine Accuracy Dengue 76.4% Pneumococcal 77.7% RSV 80.6% Meningococcal 81.7%By topic:
- Highest accuracy: Misconceptions (93.0%), Prevention (90.0%), Regulatory (87.2%)
- Lowest accuracy: Vaccine types/basic info (82.5%), Dosing/recommendations (82.5%), Cost/accessibility (82.6%)
Error Analysis Reveals Recurrent Weaknesses
An analysis of 150 incorrect responses found that:
Nearly half of all errors involved overgeneralization.
Other common error types included:
- Incorrect dosing intervals
- Misidentification of contraindications
- Incorrect age-based eligibility recommendations
- Inability to distinguish between vaccine types
Conclusions and Cautions
"Multiple-choice accuracy does not establish clinical reliability."
The authors emphasize that while modern LLMs demonstrate substantial vaccine-related knowledge, they exhibit weaknesses in areas requiring explicit clinical guidance. Inconsistent accuracy across vaccines and languages persists.
Critical takeaway: Before deployment in health settings, the study stresses the need for oversight, continuous evaluation, and robust safeguards. Further studies are needed to assess the real-world effectiveness of AI-supported health communication.