ProteomeXchange Consortium Advances FAIR Proteomics Data Sharing
An international team of authors has detailed recent progress in a database update paper published in Nucleic Acids Research. The report highlights the growth, standardization efforts, and future directions of the ProteomeXchange Consortium, which is dedicated to enabling FAIR (Findable, Accessible, Interoperable, Reusable) proteomics data sharing.
By June 2025, the consortium had amassed a total of 64,330 submitted datasets, with 44,248 (69%) publicly accessible, underscoring its role as a central pillar for open proteomics science.
Infrastructure and Data Standards
The consortium maintains a robust infrastructure for the standardized submission, storage, and dissemination of mass spectrometry-based proteomic data. Its member repositories include PRIDE, PeptideAtlas, MassIVE, jPOST, iProX, and Panorama Public.
Submitted datasets consist of raw mass spectrometry files, processed data, and experimental metadata, all structured according to Proteomics Standards Initiative (PSI) standards. Data transfer is facilitated through multiple protocols, including FTP, Aspera, HTTPS, WebDAV, and PRESTO.
Key standardization improvements have been driven by the Sample and Data Relationship Format (SDRF)-Proteomics. The system ensures traceability through unique ProteomeXchange dataset identifiers, with reanalyzed datasets receiving specific RPXD identifiers.
For users, ProteomeCentral integrates metadata from all member repositories for unified search and retrieval, while Universal Spectrum Identifiers (USIs) allow for the precise identification and visualization of individual spectra.
Growth and Submission Statistics
The data illustrates significant and accelerating adoption:
- Total submissions by June 2025: 64,330 datasets.
- Publicly accessible datasets: 44,248 (69% of total).
- Recent surge: 47% of all datasets were submitted in the three years preceding June 2025.
- Monthly volume: In June 2025 alone, 1,156 new datasets were submitted.
The PRIDE repository is the dominant submission channel, accounting for 77% of all datasets. Other contributions come from iProX (11%), MassIVE (7.4%), jPOST (3.8%), with minimal amounts from Panorama Public and PeptideAtlas. Researchers from over 80 countries have contributed data, demonstrating global engagement.
Data Reuse and AI Applications
Public datasets are actively reused to drive new biological discoveries, such as validating protein sequences and identifying post-translational modifications.
Integration with the UniProt Knowledge Base has been particularly impactful, helping to map more than 93% of the human proteome.
The availability of large-scale quantitative data is supported by resources like MassIVE.quant and quantms, which enable reproducible large-scale analyses. Furthermore, multi-omics integration is facilitated through resources like the Omics Discovery Index (OmicsDI) and MGnify.
The accumulation of high-quality public data is now a critical fuel for artificial intelligence and machine learning in proteomics. Tools such as MassIVE-Knowledge-Base (MassIVE-KB) and ProteomicsML leverage these datasets to develop predictive models for peptide identification, fragmentation patterns, and protein quantification.
Challenges and Future Directions
The consortium faces several evolving challenges that will shape its future:
- Data Privacy: Regulations like GDPR and HIPAA necessitate more sophisticated controlled-access systems and repository capabilities for handling sensitive human data.
- Emerging Technologies: New proteomics platforms that do not rely on mass spectrometry—such as affinity-based assays from SomaLogic and Olink—are rising. Incorporating data from these diverse technologies will require new resources and standards.
- Scalability: Sustaining growth demands ongoing improvements to infrastructure and data management practices.
The authors conclude that future progress for the field is contingent on successfully addressing these intertwined issues of data privacy, scalability, and the integration of emerging technological platforms.