Back
Science

AI Model OpenScholar Demonstrates High Accuracy in Scientific Literature Reviews

View source

A new artificial intelligence model, OpenScholar, and its spin-off ScholarQABench, have demonstrated the ability to surpass human performance in scientific literature reviews while significantly reducing citation hallucinations. Detailed in a study published in Nature, the models were preferred by domain experts over human-written summaries and operate at a minimal cost, offering a new tool for researchers.

New AI Model for Scientific Literature

A recent study published in the journal Nature on February 4 details the development of OpenScholar, a new artificial intelligence (AI) model, and its spin-off, ScholarQABench, designed for scientific literature reviews. These large language models (LLMs), developed by scholars, are capable of generating summaries for scientific literature. The researchers have made the method, or "recipe," for OpenScholar publicly available.

Performance and Advantages

In the study, experts in computer science, physics, neuroscience, and biomedicine evaluated summaries produced by OpenScholar and ScholarQABench alongside those written by PhD students and postdocs. The findings indicated that:

  • Domain experts preferred OpenScholar responses 51 percent of the time over human-written reviews.
  • ScholarQABench responses were preferred 70 percent of the time over human-written reviews.
  • The LLMs produced reviews that were two to three times longer than human-written summaries, averaging 1,447 or 706 words compared to 424 words, indicating greater breadth and depth of information.
  • In contrast, summaries generated by ChatGPT were preferred over human responses in approximately 31 percent of cases and reportedly struggled with information coverage.

Addressing Citation Hallucinations

A notable finding of the study was OpenScholar's reported lack of hallucinations, which refers to the generation of false information or citations.

No hallucinations were identified for reviews created by OpenScholar LLMs in computer science or biomedicine.

This contrasts with other LLMs such as ChatGPT-4 or Llama, which frequently produce false citations, with 78 to 90 percent of cases involving fabricated references. Other LLMs were found to generate plausible-looking reference lists where 78–98 percent of titles were fabricated, particularly in biomedicine, and exhibited low citation accuracy even when citing real papers.

LLMs often struggle with citation accuracy because they generate text based on probable associations from their diverse training data, which can include non-scientific sources, leading to incorrect or outdated references. An analysis using the GPTZero tool found that at least 51 papers accepted to the NeurIPS conference in December 2025 contained non-existent or inaccurate citations.

Mechanism and Training

OpenScholar combines a language model with a specialized database of 45 million open-access scientific papers. This design aims to create a "self-feedback loop" to enhance factuality, coverage, and citation accuracy by directly linking sourced information back to the literature. The OpenScholar 8B model is trained exclusively on this scientific corpus, distinguishing it from LLMs trained on the entire internet.

Accessibility, Cost, and Usage

OpenScholar is an open-source model. Researchers can use it for free via an online demonstration or deploy it on their own machines.

The literature reviews produced by OpenScholar are estimated to cost between 1 cent and 5 cents, potentially allowing scholars thousands of searches monthly.

Co-author Hannaneh Hajishirzi noted that OpenScholar operates at a fraction of the cost compared to using commercial LLMs with deep research tools. Since its demo launch, the LLM has been utilized by over 30,000 individuals, gathering nearly 90,000 user inquiries. The authors state that the method described in the paper can also enhance the literature-review skills of any LLM. Both ScholarQABench and OpenScholar are being made available to the community for ongoing research and refinement.

Identified Limitations

The authors acknowledge several limitations:

  • OpenScholar may not always retrieve the most representative or relevant papers for a query.
  • Its scope is limited by its specific database.
  • The system cannot fully automate scientific literature synthesis.
  • While commercial AI-based literature-review tools utilizing similar techniques exist, few are open source.