Evo 2: A Foundational Biological Sequence Model

Evo 2 is a biological sequence model that learns the likelihood of sequences across evolutionary datasets. This capability allows it to perform zero-shot prediction of functional importance without task-specific fine-tuning. The model assigns probabilities to sequences, with mutations reducing this probability predicted as deleterious. Evo 2 learns across DNA, RNA, and protein modalities and all three domains of life.

Evo 2 is a biological sequence model that learns the likelihood of sequences across evolutionary datasets.

Mutational Effect Prediction

Evo 2 was assessed for its ability to predict mutational effects on protein, RNA, and organismal fitness. Single nucleotide variants (SNVs) introduced around protein-coding gene start codons resulted in strong likelihood changes, particularly within start codons. A three-base periodicity was observed, reflecting triplet codons, with lower impact at wobble positions. Patterns upstream of the coding DNA sequence (CDS) were consistent with known translation initiation sequences, such as the Shine–Dalgarno sequence for prokaryotes and the Kozak sequence for eukaryotes. Similar patterns were observed for SNVs around stop codons.

Evo 2 was assessed for its ability to predict mutational effects on protein, RNA, and organismal fitness.

Mutation Types and Genetic Codes

Evo 2 demonstrated changes in model likelihoods consistent with biological constraints across various noncoding and coding sequences in prokaryotic and eukaryotic species. Non-synonymous mutations, premature stop codons, and frameshift mutations caused larger likelihood changes than synonymous mutations. Deletions in transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs) had greater effects than those in intergenic regions. The 40B model exhibited higher sensitivity to deletions in microRNA (miRNA) and small nucleolar RNA (snoRNA) sequences. Evo 2 also predicted less efficiently translated codons to have lower likelihoods.

The model also learned differences in stop codon usage among species with distinct genetic codes, including the standard, mycoplasma, and ciliate codes. It determined the appropriate genetic code based on sequence context.

Non-synonymous mutations, premature stop codons, and frameshift mutations caused larger likelihood changes than synonymous mutations.

Correlation with Experimental Fitness

Evo 2 sequence likelihoods correlated with diverse definitions of fitness across multiple prokaryotic protein datasets, eukaryotic protein datasets, and datasets of rRNAs, tRNAs, and ribozymes. The model's performance on these fitness prediction benchmarks was competitive with some protein and RNA language models but underperformed state-of-the-art models in protein deep mutational scanning (DMS).

Evo 2 sequence likelihoods correlated with diverse definitions of fitness across multiple prokaryotic protein datasets, eukaryotic protein datasets, and datasets of rRNAs, tRNAs, and ribozymes.

Exon-Intron Architecture Prediction

Lightweight models trained on Evo 2 7B base embeddings were developed as single-nucleotide resolution exon classifiers. These classifiers achieved areas under the receiver operating characteristic curve (AUROCs) ranging from 0.91 to 0.99 across eight held-out species. This performance exceeded models trained on embeddings from other genomic language models (Nucleotide Transformer, Evo 1), conservation metrics, ab initio AUGUSTUS, and SegmentNT on most tested species, suggesting utility for functional annotation.

Lightweight models trained on Evo 2 7B base embeddings were developed as single-nucleotide resolution exon classifiers.

Gene Essentiality Prediction

Using zero-shot likelihoods to score the effects of premature stop codon insertions, Evo 2 models predicted gene essentiality across bacterial, archaeal, and phage genomes. The models performed similarly to Evo 1 and better than other zero-shot methods. For human gene essentiality, Evo 2 40B achieved an AUROC of 0.66 and an AUPRC of 0.15, outperforming other genomic language models, although overall predictive performance remained modest.

Using zero-shot likelihoods to score the effects of premature stop codon insertions, Evo 2 models predicted gene essentiality across bacterial, archaeal, and phage genomes.

Conclusion

These results indicate that Evo 2 captures biological information across various modalities and domains of life. The 7B and 40B models expand predictive capabilities without compromising insights gained from prokaryotic data. Both zero-shot likelihoods and simple classifiers trained on Evo 2 embeddings provide a foundational model for downstream applications in computational biology.

Both zero-shot likelihoods and simple classifiers trained on Evo 2 embeddings provide a foundational model for downstream applications in computational biology.

Hey There!