Back
Science

Evo 2 DNA Foundation Model Published in Nature, Advancing Generative Biology

View source

Evo 2 DNA Foundation Model: A New Era in Generative Biology

The Evo 2 DNA foundation model has been published in the prestigious journal Nature, following its highly anticipated preprint release in February 2025. This groundbreaking model, trained on the DNA of over 100,000 species from across the entire tree of life, possesses an unparalleled ability to identify complex patterns in gene sequences across diverse organisms.

Evo 2 demonstrates the capability to accurately identify disease-causing mutations in human genes and can design new genomes comparable in length to those of simple bacteria.

This significant development is the result of a collaborative effort by scientists from Arc Institute and NVIDIA, with key contributions from researchers at Stanford University, UC Berkeley, and UC San Francisco.

Open Access for Accelerated Scientific Research

In a move set to accelerate scientific discovery, the Evo team has prioritized accessibility. The model's code is publicly available on Arc's GitHub and is seamlessly integrated into the NVIDIA BioNeMo framework. This collaboration between Arc Institute and NVIDIA aims to empower researchers globally.

Arc Institute also partnered with AI research lab Goodfire to develop a mechanistic interpretability visualizer. This tool provides invaluable insights, revealing the specific biological features and patterns that the model learns during its extensive training.

The Evo team has made its training data, training and inference code, and model weights publicly accessible, establishing it as the largest-scale, fully open-source AI model to date.

Unprecedented Scale and Training

Building upon its predecessor, Evo 1, which focused exclusively on single-cell genomes, Evo 2 represents the largest artificial intelligence model in biology. It was trained on an immense dataset of more than 9.3 trillion nucleotides sourced from over 128,000 whole genomes, supplemented by metagenomic data.

The model's comprehensive dataset spans a vast spectrum of life forms, including bacteria, archaea, phage genomes, humans, plants, and a variety of other single-celled and multi-cellular species within the eukaryotic domain. This breadth of data provides Evo 2 with a truly holistic understanding of genetic information.

Transformative Impact and Potential Applications

Patrick Hsu, co-senior author and Arc Institute Co-Founder, remarked that Evo 1 and Evo 2 mark "a pivotal moment in generative biology, allowing machines to interpret and create nucleotide-based information." He emphasized Evo 2's profound understanding of life, which is directly applicable to critical tasks such as predicting disease-causing mutations and designing artificial life. Brian Hie, another co-senior author, highlighted that "evolutionary patterns encoded in DNA and RNA contain crucial signals about molecular functions and interactions," signals that Evo 2 is uniquely capable of detecting.

Advanced Architecture for Deep Genomic Understanding

Evo 2's intensive training spanned several months on the NVIDIA DGX Cloud AI platform via AWS, leveraging over 2,000 NVIDIA H100 GPUs. This computational power enables the model to process genetic sequences up to 1 million nucleotides simultaneously, allowing it to understand relationships across distant genomic regions. This technical marvel was made possible by a novel AI architecture, StripedHyena 2, which allowed Evo 2 to be trained with 30 times more data and reason over 8 times more nucleotides than its predecessor, Evo 1.

The model has already demonstrated its remarkable ability to identify genetic changes that influence protein function and organism fitness. For example, in rigorous tests involving variants of the breast cancer-associated gene BRCA1, Evo 2 achieved over 90% accuracy in distinguishing between benign and potentially pathogenic mutations.

Such insights could significantly reduce the time and resources typically required for cell or animal experiments, thereby accelerating the discovery of genetic causes for human diseases and speeding up drug development.

Future Directions and Responsible Innovation

Since its initial preprint release, researchers have enthusiastically applied Evo 2 to a diverse range of scientific challenges. These include predicting genetic disease risk in Alzheimer's patients and evaluating variant effects in domesticated animals. Arc researchers have also leveraged Evo 2 to design functional synthetic bacteriophages, indicating significant potential uses in combating antibiotic-resistant bacteria.

Hani Goodarzi, a co-author, envisions Evo 2 as an instrumental tool for engineering new biological tools or treatments. This could involve designing genetic elements that activate gene therapies only in specific cell types, leading to more targeted treatments with reduced side effects. Dave Burke, Arc's Chief Technology Officer, conceptualizes Evo 2 as akin to an "operating system kernel," supporting a wide array of applications from predicting single DNA mutation effects to designing genetic elements that behave differently across various cell types.

Regarding ethical and safety considerations, the scientists have taken proactive measures. They meticulously excluded pathogens infecting humans and other complex organisms from Evo 2's core dataset. Furthermore, robust measures were implemented to prevent the model from generating informative responses to queries concerning these pathogens. Tina Hernandez-Boussard and her lab members provided essential assistance in ensuring the responsible development and deployment of this transformative technology.