Introduction to OpenScholar
OpenScholar is a new retrieval-augmented language model (LM) designed to deliver reliable and high-quality responses to information-seeking queries about scientific literature. Its primary function is to identify relevant papers, synthesize their findings, and generate a response with accompanying in-line citations to specific passages from scientific literature.
OpenScholar's primary function is to identify relevant papers, synthesize their findings, and generate a response with accompanying in-line citations to specific passages from scientific literature.
Task Formulation and Challenges
The task involves creating a response grounded in retrieved evidence while providing verifiable citations. Key challenges include: retrieving high-recall, high-precision scientific content from vast, specialized corpora; synthesizing accurate, non-hallucinated responses; producing citation-aware outputs; and addressing the scarcity of large-scale, up-to-date scientific corpora and supervised training data.
Overview of OpenScholar's Innovations
OpenScholar extends the standard Retrieval-Augmented Generation (RAG) model for scientific literature synthesis by incorporating domain-specialized retrieval, citation-aware generation, and a self-feedback inference mechanism. It is built upon an open and large-scale scientific data store. The system formally consists of three main components: a data store (D), a retriever (R), and a generator LM (G). The process begins with the retriever identifying relevant passages from D, which then serve as context for the generator LM to produce the output response (y) along with citations (C).
Key technical contributions include:
- OpenScholar Scientific Data Store (OSDS): A database comprising 45 million scientific papers with precomputed dense embeddings, representing the largest and most current open-sourced scientific literature data store available, with 236 million passages.
- Optimized Retrieval Pipeline: Integrates a trained OpenScholar retriever and reranker, optimized on scientific data, to select the most relevant passages for the generator, ensuring broad coverage and improved relevance.
- Iterative Self-Feedback Inference: The LM generates an initial draft, then refines it iteratively using retrieval-augmented self-feedback and citation verification to improve factuality and evidence grounding.
- High-Quality Training Data Generation: The inference pipeline generates specialized training data, enabling the development of LMs that produce more accurate and citation-aware long-form answers.
OpenScholar Retrieval Pipeline
The retrieval pipeline comprises the OSDS data store, a bi-encoder retriever (θbi), and a cross-encoder reranker (θcross). Initial candidate paragraphs are selected using OSDS and θbi, alongside external APIs, and then refined by θcross to identify the top N relevant paragraphs.
Scientific Paper Collection and Data Store Construction
OSDS uses peS2o v3, including 45 million papers up to October 2024, to construct its data store of 236 million passages. Papers are split into discrete, 256-word text blocks, with the title concatenated to each block.
Initial Paragraph Retrieval
Passages are retrieved from three sources:
- OSDS: Using the trained passage bi-encoder θbi, which processes text into dense vectors. θbi is continually pre-trained on peS2o data in an unsupervised fashion to improve domain-specific retrieval. Top 70 passages are retrieved via nearest-neighbor search.
- Semantic Scholar API: Keywords generated from the query are used to retrieve top 10 papers ranked by citation count. Full text is extracted if available; otherwise, only the abstract is used.
- Web Search Engine: Top 10 results are obtained via You.com API, restricted to academic platforms like arXiv and PubMed. Full texts are added if open-access, or only abstracts are included.
Top N Paragraph Reranking and Finalization
A cross-encoder reranker (θcross) fine-tuned on synthetic data using Llama-3-70B-Instruct computes a relevance score between the query and each candidate passage. This process ranks passages, and meta-filtering limits passages to three per paper and incorporates normalized citation counts into relevance scores.
Inference: Self-Reflective Iterative RAG
This approach addresses unsupported claims and incomplete outputs in standard RAG by introducing an iterative generation process with self-feedback.
This approach addresses unsupported claims and incomplete outputs in standard RAG by introducing an iterative generation process with self-feedback:
Initial Response and Feedback Generation
The generator LM produces an initial response (y0) with citation markers and a set of natural language feedback (F) aimed at improving y0. Feedback can include retrieval queries for missing content.
Iterative Refinement
The LM iterates through the feedback, refining the output (yk) using previous outputs and potentially new passages retrieved based on feedback queries. This continues until all feedback is addressed, resulting in a final output (yT).
Citation Verification
The generator LM verifies that all citation-worthy statements are supported by references from retrieved passages and performs post-hoc insertions for any claims lacking proper citations.
Synthetic Training Data Generation with Inference Pipeline
To overcome the scarcity of training data for scientific LMs, OpenScholar's inference pipeline is used to synthetically generate high-quality training data through self-feedback. The process involves selecting top-cited papers, generating information-seeking queries based on their abstracts, and using the OpenScholar pipeline to produce final outputs and intermediate generations (feedback, initial outputs).
Data filtering is applied through pairwise comparison of final (yT) and initial (y0) outputs, and rubric filtering based on organization, factual precision, and citation accuracy. This synthetic data is then blended with existing general-domain and scientific instruction-tuning data to train generator LMs, such as Llama-3.1-8B-Instruct.
ScholarQABench: A Comprehensive Scientific Literature Synthesis Benchmark
ScholarQABench was developed to address limitations in previous benchmarks, which often relied on small-scale human evaluations or oversimplified multiple-choice QA.
ScholarQABench was developed to address limitations in previous benchmarks, which often relied on small-scale human evaluations or oversimplified multiple-choice QA. Creating such datasets is challenging, requiring PhD-level domain expertise. The benchmark supports diverse formats, including closed-form classification, multiple-choice, and long-form generation.
Data Curation Principles
ScholarQABench's curation is guided by:
- Diversity of Tasks: Includes tasks with various input-output formats.
- Diversity of Disciplines: Spans computer science, biomedicine, physics, and neuroscience.
- Inclusion of Multi-paper Tasks: Unlike prior work, all tasks require retrieval from an entire open-access collection of full texts, with four datasets specifically requiring reasoning over multiple papers. It is the first multidisciplinary benchmark for long-form generation grounded in several recent expert-annotated papers.
ScholarQABench Tasks
Single-paper Tasks (Open Retrieval):- SciFact: 1,400 expert-written biomedical claims, task reformulated as binary open retrieval (supports/contradicts).
- PubMedQA: Expert-annotated (yes/no/maybe) QA data, reformulated for open-retrieval (yes/no).
- QASA: QA dataset requiring reasoning over AI/ML scientific articles, evaluated in an end-to-end QA setup.
- Scholar-CS: 100 questions with detailed answer rubrics for various computer science disciplines. Includes 31 expert-written long-form answers.
- Scholar-Bio and Scholar-Neuro: 2,759 expert-written literature review questions in biomedicine and neuroscience.
- Scholar-Multi: 108 literature review questions and expert-written answers with citations across computer science, biomedicine, and physics domains. Annotations are by PhD students or postdoctoral scientists.
Metrics and Evaluation Protocols
A multifaceted automatic evaluation pipeline complements expert assessments.
- Correctness:
- For single-paper tasks, accuracy is measured by exact match (SciFact, PubMedQA) or ROUGE-L (QASA).
- For multi-paper tasks (Scholar-CS), a rubric score is used, combining annotation-driven criteria (60%) and general criteria (40%), scored by GPT-4o Turbo.
- Citation Accuracy: Measured by citation F1 score, evaluating recall (appropriate citations for statements) and precision (relevance and necessity of citations). Applicable across all tasks.
- Content Quality and Organization (Scholar-Multi): Evaluates relevance, topic coverage (breadth and depth), organization, and writing flow using Prometheus v2 and human experts with five-point rubrics.