Database Curation
Prokaryotic Database
Approximately 75 million sequences were curated from 47,545 fully sequenced prokaryotic genomes from NCBI GenBank (November 2023). This was supplemented with 441,150 protein sequences from 63 Asgard metagenomes.
- Protein sequences were derived from GenBank annotations or Prodigal v2.6.3.
- Soft-core pangenomes were constructed for 26 taxonomic groups (primarily 'class' rank) to identify sequences widely present across prokaryotic families and minimize post-LECA horizontal transfer.
- Sequences were clustered using mmseqs2; clusters were rejected if they did not contain sequences from at least 50% of bacterial species or 20% of eukaryotic sequences.
Eukaryotic Database
This database began with 72 curated genomes from NCBI GenBank, augmented by EukProt v.3, resulting in Euk72Ep with 30.3 million sequences.
- The Euk72Ep database was screened for prokaryotic contamination using candidate HMM databases and hhblits, with top prokaryotic HMM hits being removed.
Curation of Taxonomic Labels
Taxonomic labels were curated for both Euk72Ep and Prok2311As to ensure a balance between specificity and accuracy.
- Species in EukProt were manually assigned to their closest relatives in NCBI taxonomy.
- Species from Prok2311As and Euk72Ep were assigned 'Class' rank labels based on the NCBI Taxonomy of November 2023 using ete3 v.3.1.3 and custom scripts.
- These identifiers were further mapped to a curated set of 45 eukaryotic and 26 prokaryotic taxonomic superclasses (e.g., Metazoa, TACK archaea).
- All candidate Asgard sequences were assigned to the 'Asgard' class label.
- Deltaproteobacteria were reclassified into Myxococcota, Desulfobacteriota, Bdellovibrionota, and SAR324; unmappable species were retained as Deltaproteobacteria.
Sequence Clustering and Profile Database Generation
An unsupervised, cascaded sequence-profile clustering pipeline using mmseqs2 was implemented to transform sequence databases into profile databases.
- Initial clustering at 90% sequence identity and 80% pairwise coverage yielded a non-redundant set of 6.3 million prokaryotic and 25.1 million eukaryotic sequences.
- These sequences were then iteratively collapsed into profiles and consensus sequence pairs until cluster sizes converged.
- Final clustered databases contained 14.1 million eukaryotic and 91,000 prokaryotic clusters.
- HMM profiles were constructed using a mixed MSA strategy involving FAMSA, FastTree (for outlier removal), muscle5 (re-alignment), and trimming based on Shannon information content.
- The final HH-suite databases comprised 480,000 prokaryotic and 1.6 million eukaryotic profiles.
Cluster Annealing
To ensure clustering results were robust, a cluster annealing step was developed to redefine clusters based on monophyletic groups in phylogenetic trees.
- HMM profiles were calculated for non-singleton clusters, and all-versus-all HMM-HMM searches were performed using HHblits (80% probability, 50% pairwise profile coverage).
- Resulting hits were clustered using greedy set clustering to form superclusters.
- Superclusters were reduced using a 'prune-and-align' approach, and master trees were constructed with FastTree.
- Tree partitioning was conducted by embedding pairwise leaf distances into two dimensions using UMAP and clustering with HDBSCAN to identify well-separated monophyletic sets.
- This process yielded 1.5 million annealed eukaryotic clusters and 280,000 annealed prokaryotic clusters.
Search for Prokaryotic Homologues of Eukaryotic Proteins
To identify shared protein families, a search was conducted for prokaryotic homologues of eukaryotic proteins.
- Eukaryotic HMM profiles with fewer than ten sequences and a lowest common ancestor (LCA) taxonomic rank below 'Superkingdom' were excluded, retaining 142,000 profiles.
- The Euk72Ep HMM database was queried against Prok2311As HMM profiles using HHBlits (80% probability, 80% pairwise profile length).
- This identified 20,700 eukaryotic query profiles targeting 8,300 unique prokaryotic profiles, containing 5.7 million and 1.9 million sequences, respectively.
EPOC MSA Construction
Eukaryotic and Prokaryotic Orthologous Clusters (EPOCs) were formed by combining eukaryotic clusters (>=10 sequences, LCA 'Superkingdom') with homologous prokaryotic clusters.
- Robust data subsampling was performed using a modified 'prune-and-align' strategy that accounted for taxonomic distribution, limiting eukaryotic and prokaryotic sequences to a maximum of 30 and 70, respectively, per EPOC (total 100 sequences).
- EPOCs were aligned using muscle -diversified, and alignments were trimmed to keep columns with more than 0.15 bits of Shannon information content.
EPOC Tree Construction and Processing
Maximum likelihood trees were created for EPOCs, followed by processing to ensure data quality and accurate clade assignment.
- Trees were built using IQtree2 with model parameters estimated by Model finder plus.
- Long-stemmed clade and leaf outliers were identified and removed by statistical analysis of stem lengths.
- Trees were re-rooted using a weighted midpoint approach.
- A 'soft LCA' method was adopted for taxonomic clade assignment, balancing clade purity and scope, considering clades valid if they represented at least three prokaryotic sequences (purity >0.8) or five eukaryotic sequences (purity >0.8).
- Trees failing to meet these clade identification criteria, or showing high paraphyly (>3 valid eukaryotic clades), were discarded.
Evolutionary Hypothesis Testing Using Constraint Trees
Evolutionary hypothesis testing was performed using constraint trees in IQtree2 to assess the relative probabilities of prokaryotic clades representing eukaryotic sister clades.
- For each EPOC and eukaryotic clade, constraint trees were generated, enforcing three defined clades: the eukaryotic group, a specific prokaryotic sister group, and all other prokaryotic leaves.
- Local trees were constructed from MSA slices with forced topologies, and ranked using IQtree2 to calculate the relative model assignment confidence (ELW score).
- Sampling was limited to a maximum of three eukaryotic clades and the 12 closest prokaryotic clades per tree.
EPOC Annotation
Functional annotation of protein families within EPOCs was performed using KEGG release v.110.
- HMM profiles were generated for KEGG Orthologous Groups (KOGs), partitioned taxonomically for prokaryotic and eukaryotic sequences.
- Eukaryotic HMM profiles forming EPOCs were queried against the KEGG profile database using HHblits (80% probability, 70% pairwise coverage).
- 12,600 out of 13,500 EPOCs were successfully annotated.
Data Filtering and Removal of Cases of Possible Late HGT
Only a core set of EPOCs meeting specific criteria were used for analysis to ensure reliability and exclude potential late horizontal gene transfer (HGT).
- Criteria included: eukaryotic cluster profile identifying a KEGG target (80% probability, 70% coverage); eukaryotic clades with more than five distinct taxonomic labels, including Amorphea and Diaphoretickes; and prokaryotic sister clades with 0.4 < ELW < 0.99.
- Cases with ELW >= 0.99 were generally excluded as likely indicative of late HGT, with an exception for alphaproteobacterial associations with Oxidative phosphorylation.
Taxonomy Remapping Under GTDB Taxonomy
To ensure robustness, the prokaryotic database was reformatted under the GTDB taxonomy (v.220) at the 'phyla' level.
- Genomes present in both Prok2311 and GTDB were directly annotated.
- Remaining genomes were assigned taxonomy through consensus voting using GTDB marker genes.
- This process mapped Prok2311 data to 96 GTDB phyla, which were used for subsequent EPOC calculations. Asgard data was directly assigned Asgardarchaeota.