Back
Science

Study Evaluates LLM Accuracy on Multilingual Vaccine Questions

View source

AI Models Show Strong Vaccine Knowledge, but Gaps Remain Across Languages and Topics

A new study published in npj Vaccines has evaluated the accuracy of large language models (LLMs) in answering vaccine-related questions across multiple languages, revealing both impressive capabilities and critical limitations.

The Study at a Glance

Researchers developed VaxEval, a multilingual benchmark consisting of 1,886 multiple-choice questions covering 14 vaccines in English, Spanish, and Chinese.

Questions spanned a wide range of topics:

  • Vaccination schedules
  • Efficacy and safety
  • Adverse effects
  • Myths and misconceptions
  • Access and cost
  • Disease prevention

Sources included: WHO, CDC, UNICEF, Africa CDC, AMA, Immunize.org, and peer-reviewed literature.

Models and Methods Tested

A total of 13 LLMs were assessed:

  • GPT-4.5, GPT-4o, GPT-4, GPT-3.5-Turbo
  • Claude 3 Opus, Gemini 1.5 Pro
  • Llama-4 Maverick, DeepSeek-V3, Grok-3
  • Qwen 2.5, GLM-4, Reka Core, Yi-Lightning

Prompting methods included zero-shot, few-shot, and chain-of-thought approaches.

Key Results

Average accuracy: 86.0% (English), 83.7% (Spanish), 80.0% (Chinese).

GPT-4o achieved the highest overall accuracy at 90.3%, closely followed by Llama-4 Maverick (90.2%) and DeepSeek-V3 (89.6%).

Flagship models showed 57% higher odds of correct answers compared to earlier systems.

Prompting Strategy Matters

  • Few-shot prompting increased the likelihood of correct responses by 17% compared to zero-shot.
  • Chain-of-thought prompting was surprisingly associated with 21% lower odds of correctness.

Accuracy Varies by Vaccine and Topic

Best-performing vaccines:

Vaccine Accuracy Influenza 90.5% Hepatitis A 89.5% HPV 88.4% COVID-19 85.3%

Most challenging vaccines:

Vaccine Accuracy Dengue 76.4% Pneumococcal 77.7% RSV 80.6% Meningococcal 81.7%

By topic:

  • Highest accuracy: Misconceptions (93.0%), Prevention (90.0%), Regulatory (87.2%)
  • Lowest accuracy: Vaccine types/basic info (82.5%), Dosing/recommendations (82.5%), Cost/accessibility (82.6%)

Error Analysis Reveals Recurrent Weaknesses

An analysis of 150 incorrect responses found that:

Nearly half of all errors involved overgeneralization.

Other common error types included:

  • Incorrect dosing intervals
  • Misidentification of contraindications
  • Incorrect age-based eligibility recommendations
  • Inability to distinguish between vaccine types

Conclusions and Cautions

"Multiple-choice accuracy does not establish clinical reliability."

The authors emphasize that while modern LLMs demonstrate substantial vaccine-related knowledge, they exhibit weaknesses in areas requiring explicit clinical guidance. Inconsistent accuracy across vaccines and languages persists.

Critical takeaway: Before deployment in health settings, the study stresses the need for oversight, continuous evaluation, and robust safeguards. Further studies are needed to assess the real-world effectiveness of AI-supported health communication.