Patient Sample Collection and Ethical Considerations
Blood samples were diligently collected from patients presenting with a diverse range of conditions, including Colorectal Cancer (CRC), Diffuse Large B-cell Lymphoma (DLBCL), Mantle Cell Lymphoma (MCL), Follicular Lymphoma (FL), transformed Follicular Lymphoma (tFL), T-cell Lymphoma (TCL), Coronary Heart Disease (CHD), and Inflammatory Bowel Disease (IBD). Samples from healthy individuals were also procured to serve as controls. The collection effort was supported by specific hospitals located in Shanghai and Beijing.
All research was conducted in strict adherence to relevant ethical regulations, with informed consent obtained from all individuals or their legal guardians.
The study received comprehensive approval from multiple ethics committees, including those at Peking University Third Hospital, Ruijin Hospital, Renji Hospital, Chinese PLA General Hospital, and Peking University First Hospital, ensuring a robust ethical framework for the research.
Clinical Variable Assessment
The histological and cell-of-origin subtypes for various cancers, such as DLBCL, MCL, and FL, were meticulously established following established clinical guidelines. This involved a combination of microscopy, immunohistochemistry, and the Hans classifier, all performed in accordance with WHO guidelines. Notably, tFL samples represented the morphological transformation stage from FL to DLBCL, a diagnosis rigorously validated by immunohistochemical staining or fluorescence in situ hybridization.
Patient samples encompassed both pre-treatment and in-treatment phases, reflecting the dynamic nature of disease progression and therapeutic intervention. Treatment strategies were tailored to the specific cancer type; for example, R-CHOP-like therapy was administered for DLBCL patients. Treatment response was subsequently evaluated using the internationally recognized Lugano 2014 criteria. Extranodal involvement sites and metabolic tumor volume were precisely determined through PET-CT or CT scans. Colorectal precancerous lesions (CRA) were definitively diagnosed by colonoscopy. Survival data, including progression-free and overall survival, were calculated from the initiation of treatment.
Laboratory Procedures
Barcoded Tn5 Preparation
The Tn5 transposase was expressed, purified, and meticulously assembled following established protocols. The pTXB1-Tn5 plasmid was transformed into BL21 (DE3) cells, induced for protein expression, and subsequently purified. Barcoded adapters were carefully annealed with MErev oligos and then incubated with the purified Tn5 transposase to form the active transposome complex, which was stored at -20 °C for future use.
Plasma Sample Collection
Blood samples were collected in K2 EDTA tubes and immediately transferred to ice to preserve sample integrity. They were promptly treated with proteinase inhibitors and sodium butyrate to prevent degradation. Plasma was isolated using a two-step centrifugation process, ensuring high purity. The isolated plasma was then either used fresh for immediate experiments or flash-frozen and stored at -80 °C for long-term preservation.
Spike-in Chromatin Preparation
- S2 Chromatin (Technical Control): Fixed S2 cells were sonicated to prepare chromatin fragments, approximately 300 bp in size. These fragments served as a crucial technical control, spiked into each plasma sample for robust normalization of experimental data.
- CRC Tumor Chromatin & S2 Nucleosomes (Sensitivity Analysis): Chromatin derived from CRC tumors and S2 nucleosomes, prepared through MNase digestion, were utilized to rigorously assess the sensitivity of the cf-EpiTracing method in detecting external chromatin signals.
cf-EpiTracing Experimental Procedure
Antibodies, such as anti-H3K4me1 and anti-H3K27ac, were conjugated to magnetic beads. These antibody-beads, along with proteinase inhibitors, sodium butyrate, and the S2 chromatin spike-in, were added to 200 µl of plasma. The mixture was rotated overnight to facilitate the capture of cf-chromatin. Following incubation, the beads were washed and resuspended in tagmentation buffer containing the prepared barcoded Tn5. The tagmentation reaction was activated at 37 °C for 30 minutes, then promptly stopped, and proteins were digested using lysis buffer and Proteinase K. PCR enrichment was subsequently performed using KAPA HiFi DNA polymerase and indexed primers. The amplified products were purified and sequenced with paired-end 150-bp reads on the DNBSEQ-T7 platform. Several key steps in this procedure were optimized for automated processing, enhancing efficiency and reproducibility.
In situ ChIP with Tumour Tissues
Tumor tissues were first homogenized and then incubated with specific primary antibodies (anti-H3K4me3, anti-H3K9ac, anti-H3K27ac). This was followed by incubation with secondary antibodies and PAT-MEAB. Samples were then thoroughly incubated in a reaction buffer, the reaction was stopped with EDTA, and proteins were digested. PCR enrichment was performed using Nextera index primers, followed by size selection (200-1,000 bp) and subsequent sequencing on the DNBSEQ-T7 platform.
Bioinformatics and Data Analysis
Mapping, Visualization, and Peak Calling
Sequencing reads underwent stringent filtering, adapter removal, and were then mapped to both the human (hg19) and Drosophila (dm3) reference genomes using Bowtie2. Only uniquely mapped, non-duplicated reads with a high map quality were retained for downstream analysis. Deeptools BamCoverage generated normalized coverage tracks, which were visually inspected using Integrative Genomics Viewer. Heat maps and average curves were also generated using Deeptools. Peaks, indicative of enriched regions, were identified using the MACS2 algorithm.
Receiver Operating Characteristic (ROC) Curve for Accuracy
ROC curves were employed to rigorously evaluate cf-EpiTracing data quality and benchmark its performance against cfChIP. This analysis specifically focused on peak regions of H3K4me3, H3K9ac, H3K27ac, and H3K36me3 signals. Gold standard true positives were precisely defined as high-confidence peaks located on promoter or gene body regions.
Integrated Analysis of Multiple Histone Modifications (ChromHMM)
ChromHMM (v.1.23) utilized a multivariate hidden Markov model to identify 18 distinct Integrative Chromatin States (ICSs). These states were defined based on the combinatorial patterns of seven core chromatin marks across various epigenomes. Each 200-bp genomic bin was discretized as either enriched or not enriched for each mark, forming the basis for state assignments.
Definition of Tissue-Specific Sites with ICSs
Tissue-specific regions, termed "tissue signatures," were meticulously defined as 200-bp genomic bins that were exclusively labeled in one specific tissue or primary cell with a posterior probability greater than 0.9, while having a probability less than 0.1 in all other tissues or cells.
Calculation of Signals of Tissue Signatures for Plasma Samples
Genomic regions corresponding to selected histone modifications (H3K4me3, H3K9ac, and H3K27ac) were used to establish genome-wide ICSs. Plasma samples' cf-chromatin signals were then systematically summarized across the 18 ICSs for 65 distinct tissues/cells. This process generated a comprehensive vector of 1,170 values, effectively representing the captured tissue-specific signals present in each plasma sample.
Unsupervised Clustering Analyses
Unsupervised clustering techniques were applied to compare the contribution of individual and combined histone modifications in accurately clustering samples. This included both tissue/primary cell samples (derived from ChIP-seq data) and plasma samples (from cf-EpiTracing data of DLBCL patients and healthy controls). Seurat was employed for normalization and Principal Component Analysis (PCA). K-means clustering, Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI) were used to quantitatively assess clustering performance.
Identification of Differential ICSs
Differential ICSs between distinct groups, such as CRC patients versus healthy individuals, were identified using H3K4me3, H3K9ac, and H3K27ac data. Analysis focused on genomic regions at 5,000-bp resolution, spanning from 20,000 bp upstream of the Transcription Start Site (TSS) to 20,000 bp downstream of the Transcription End Site (TES). Differential analysis was performed using DESeq2, applying a stringent threshold of |log2(fold change)| = 1 and an adjusted P-value = 0.05 (Benjamini–Yekutieli).
Selection of ICSs in Downstream Diagnostic and Prognostic Analyses
The 18-ICS ChromHMM model, initially trained with eight histone modifications, was utilized. For cf-EpiTracing analyses, H3K4me3, H3K9ac, and H3K27ac were prioritized due to their established association with active regulatory elements. Ten specific ICSs (ICS6, ICS7, ICS9, ICS12, ICS13, ICS14, ICS15, ICS16, ICS17, and ICS18) were carefully selected for downstream diagnostic, prognostic, and machine-learning applications as they exhibited clear and distinguishable patterns.
GLM Analyses
Generalized Linear Models (GLMs) were employed to assess the distinguishability of tissue signature-associated ICS signals among patients with CRC, CHD, or B cell lymphoma. The dataset was partitioned into an 80% training set and a 20% independent testing set. Binomial logistic regression GLMs were constructed for each disease using disease-specific ICS signatures identified from the training data. Model performance was then rigorously assessed using the Area Under the Curve (AUC) on the independent testing dataset.
Development of Unbiased Screening Model
An unbiased screening model was developed based on the assumption that diseased tissues would exhibit increased tissue-specific signals. Signals from the 1,170 tissue-signature ICSs in healthy individuals were fitted to a normal distribution, with the 90th percentile defining a 'disease signal' threshold. Tissue enrichment scores were calculated by summing these disease signals, aggregated into broader categories such as colorectum, heart, or lymphocyte. Random forest feature selection was applied to identify the top 5 tissue-signature ICSs for each tissue, specifically chosen to enhance differentiation between patient and control groups. The tissue with the highest enrichment score was then identified as the primary diseased tissue.
Detection of Tumour-Derived Signals in Plasma
Tumor-derived ICSs in CRC patients were identified by directly comparing ChIP-seq data from tumor and normal tissues. Concurrently, cancer-specific ICSs in plasma were identified by comparing cf-EpiTracing data from patients and healthy individuals. Tumor-derived signals captured in patient plasma were defined as cancer-specific ICSs detected in both tumor tissue and patient plasma samples, with statistical significance evaluated by a hypergeometric test (P < 0.05). Non-cancer and shuffled controls were included to meticulously verify the specificity of these detections.
XGBoost Machine Learning
XGBoost models were developed for two primary applications: (1) classifying CRC patients versus healthy individuals and detecting early colorectal precancerous lesions (CRA), and (2) diagnosing and grading DLBCL patients. Models were trained and extensively tuned using Bayesian optimization with 10-fold cross-validation, aiming to maximize AUC for classification tasks and minimize RMSE for regression tasks. Validation groups consisted of independent cohorts sourced from separate medical centers. The CRC-Healthy classifier was then applied to unseen CRA samples. For DLBCL, optimal cut-offs for staging were determined using the Youden index.
Stratification of Prognosis of Patients with CRC
Differential analysis identified 747 significantly upregulated ICSs in CRC patients within the discovery dataset. Hierarchical clustering of these ICSs effectively divided CRC patients into two distinct subgroups (CRC-1 and CRC-2). Kaplan–Meier survival analysis and log-rank tests were then used to compare recurrence-free survival between these subgroups in both the discovery and an independent validation dataset, demonstrating prognostic utility.
Disease Subtyping Analyses in Patients with B cell Lymphoma
Tissue signatures derived from CD34-positive, naive B, and GCB cells were meticulously quantified. Multi-class ROC analysis was utilized to evaluate the accuracy of B cell lymphoma subtype classification. Differential analyses identified ICSs exclusively enriched in each lymphoma subtype, followed by hierarchical clustering to visualize these distinctions. Similarly, GCB and non-GCB DLBCL subtypes were clustered based on their unique tissue-signature ICSs and rigorously benchmarked against the established Hans classification.
Identification of Recurrence-Related ICSs for DLBCL
Differential analysis revealed 432 upregulated ICSs in DLBCL patients. A multivariate Cox proportional hazards model, carefully adjusted for key clinical indices (stage, age, LDH, β2-MG, WBC), was employed to identify 8 ICSs that were significantly associated with recurrence (adjusted P < 0.05). An integrated ICS score, weighted by log HR values, was subsequently generated for robust prognosis prediction. Optimal thresholds for risk stratification were determined using maximally selected rank statistics. Kaplan–Meier survival curves and log-rank tests assessed the prognostic value of this score, which was also compared with the International Prognostic Index (IPI).
Pseudo-time Analyses
Monocle3 (v.1.2.9) was utilized to delineate sample trajectories and calculate pseudotime for DLBCL, FL, and tFL samples. This analysis was based on similarities in tissue-of-origin signals (1,170 tissue-signatures.ICSs), effectively resolving the complex FL-DLBCL transformation trajectories.
Transcription Factor Motif Analysis
HOMER (v.4.11) was employed for comprehensive transcription factor motif discovery on specific 200-bp genomic bins. The analysis primarily focused on active ICSs in tFL samples (those repressive in 80% of FL samples) and active ICSs in 80% of DLBCL samples (those repressive in 80% of FL samples). P-values were adjusted using the Benjamini–Hochberg procedure to control for multiple comparisons.
Detection of t(11;14) Translocation Events
Translocation events were precisely detected by aligning 20-bp segments of unaligned sequencing reads to chromosomes 11 and 14. Fragments were preserved only if one segment aligned to chromosome 11 and the other to chromosome 14, with cross-validation using paired reads. False positives were diligently eliminated by excluding alignments outside known candidate loci. Translocation scores were defined by the sum of detected events across H3K4me3, H3K9ac, and H3K27ac cf-EpiTracing data.
Identification of Ageing-Related ICSs
Ageing-related differential ICSs were defined as those significantly correlated with age in healthy individuals (adjusted P < 0.05, Pearson correlation > 0.5) and simultaneously exhibited a consistent ageing-related pattern when comparing MCL patients to healthy controls.
Gene Ontology (GO) Term Enrichment Analysis
Genes associated with identified differential ICSs were subjected to Gene Ontology (GO) term enrichment analysis using clusterProfiler. Hypergeometric tests and Benjamini–Hochberg adjustment were applied to identify overrepresented GO terms across Biological Process, Molecular Function, and Cellular Component categories.
Statistical Analysis
Generally, two-sided statistical tests were employed, with a P-value of < 0.05 considered statistically significant. Multiple testing corrections included Bonferroni, Benjamini–Yekutieli, or Benjamini–Hochberg procedures as appropriate. A range of statistical methods was utilized, including log-rank tests, Wald log-likelihood tests, paired two-tailed Student’s t-tests, Kruskal–Wallis tests, hypergeometric tests, and Kolmogorov–Smirnov tests. All analyses and visualizations were performed using custom R scripts and various R packages (e.g., ggplot2, pheatmap, survminer, Survival).
Reporting Summary
Further comprehensive information regarding the research design is readily available in the Nature Portfolio Reporting Summary, which is linked to this article.