Shotgun sequencing and generation of metagenome-assembled genomes

In our previous study, 78 Holstein Friesian dairy cows were sampled for rumen content, metagenomic shotgun sequencing was carried out, and raw Illumina sequencing reads were assembled into contigs using megahit assembler using default settings [7]. We used a pooled assembly of the original 78 samples to increase the quality of the metagenome-assembled genomes (MAGs) with the syntax: megahit [14] -t 60 -m 0.5 1 [Illumina R1 files] 2 [Illumina R2 files]. Next, the assembled contigs were indexed using BBMap [15]: bbmap.sh threads=60 ref=[contigs filename]. Thereafter, reads from each sample were mapped to the assembled contigs using BBTools bbwrap.sh script. In order to determine the depth (coverage) of each contig within each sample, the gi_summarize_bam_contig_depths tool was applied with the parameters: gi_summarize_bam_contig_depths --outputDepth depth.txt --pairedContigs paired.txt *.bam --outputDepth depth.txt --pairedContigs paired.txt.

Using the depth information, metabat2 [16] was executed to bind genes together into reconstructed genomes, with parameters: metabat2 -t40 -a depth.txt.

To evaluate genomic bin quality, we used the CheckM [17] tool, with parameters: checkm lineage_wf [in directory] [out directory] -x faa --genes -t10.

We generated 93 unique high-quality MAGs, and further increased our MAG database by including phyla that were not represented in our set of MAGs. In order to do so, we used the published compendium of 4,941 rumen metagenome-assembled genomes [18] and dereplicated those MAGs using dRep [19]. We then selected MAGs from phylum Spirochaetes, Actinomycetota, Proteobacteria, Firmicutes, Elusimicrobia, Bacillota, Fibrobacteres and Fusobacteria, which had the highest mean coverage in our samples as calculated using BBMap and gi_summarize_bam_contig_depths as described above [15]. This strategy minimized the false discovery rate (FDR), that would have been obtained if larger and unspecific databases would have been employed [20] and allowed the addition of 14 MAGs to our database.

In order to create the proteomic search library, genes were identified along the 107 MAGs using the Prodigal tool [21], with parameters: prodigal meta and translated in silico into proteins, using the same tool. Replicates sequences were removed. Protein sequences from the hosting animal (Bos taurus) and common contaminant protein sequences (64,701 in total) were added to the proteomic search library in order to avoid erroneous target protein identification originating from the host or common contaminants. Finally, in order to subsequently assess the percentage of false-positive identifications within the proteomic search [22], the proteomic search library sequences were reversed in order and served as a decoy database.

The bacterial fraction from rumen fluid of the 12 selected animals selected from extreme feed efficiency phenotypes, were obtained at the same time as the samples analyzed for metagenomics and stored at 20C until extraction. To extract total proteins, a modified protocol from Deusch and Seifert was used [23]. Briefly, cell pellets were resuspended in 100l in 50 mM Tris-HCl (pH 7.5; 0.1 mg/ml chloramphenicol; 1 mM phenylmethylsulfonyl fluoride (PMSF)) and incubated for 10min at 60C and 1200rpm in a thermo-mixer after addition of 150l 20mM Tris-HCl (pH 7.5; 2% sodium dodecyl sulfate (SDS)). After the addition of 500l DNAse buffer (20mM Tris-HCl pH 7.5; 0.1mg/ml MgCl2, 1mM PMSF, 1g/ml DNAse I), the cells were lysed by ultra-sonication (amplitude 5160%; cycle 0.5; 4 2min) on ice, incubated in the thermo-mixer (10min at 37C and 1,200rpm) and centrifuged at 10,000 g for 10min at 4C. The supernatant was collected and centrifuged again. The proteins in the supernatant were precipitated by adding 20% pre-cooled trichloroacetic acid (TCA; 20% v/v). After centrifugation (12,000 g; 30min; 4C), the protein pellets were washed twice in pre-cooled (20C) acetone (2 10min; 12,000 g; 4C) and dried by vacuum centrifugation. The protein pellet was resuspended in 2 SDS sample buffer (4% SDS (w/v); 20% glycerin (w/v); 100mM Tris-HCl pH 6.8; a pinch of bromophenol blue, 3.6% 2mercaptoethanol (v/v)) by 5min sonication bath and vortexing. Samples were incubated for 5min at 95C and separated by 1D SDS-PAGE (Criterion TG 4-20% Precast Midi Gel, BIO-RAD Laboratories, Inc., USA).

As previously described, after fixation and staining, each gel line was cut into 10 pieces, destained, desiccated, and rehydrated in trypsin [24]. The in-gel digest was performed by incubation overnight at 37C. Peptides were eluted with Aq. dest. by sonication for 15min The sample volume was reduced in a vacuum centrifuge.

Before MS analysis, the tryptic peptide mixture was loaded on an Easy-nLC II or Easy-nLC 1000 (Thermo Fisher Scientific, USA) system equipped with an in-house built 20cm column (inner diameter 100m; outer diameter 360m) filled with ReproSil-Pur 120 C18-AQ reversed-phase material (3m particles, Dr. Maisch GmbH, Germany). Peptides were eluted with a nonlinear 156min gradient from 1 to 99% solvent B (95% acetonitrile (v/v); 0.1% acetic acid (v/v)) in solvent A (0.1% acetic acid (v/v)) with a flow rate of 300ml/min and injected online into an LTQ Orbitrap Velos or Orbitrap Velos Pro (Thermo Fisher Scientific, USA). Overview scan at a resolution of 30,000 in the Orbitrap in a range of 300-2,000m/z was followed by 20 MS/MS fragment scans of the 20 most abundant precursor ions. Ions without detected charge state as well as singly charged ions were excluded from MS/MS analysis. Original raw spectra files were converted into the common mzXML format, in order to further process it in downstream analysis. The spectra file from each proteomic run of a given sample was searched against the protein search library, using the Comet [25] search engine with default settings.

The TPP pipeline (Trans Proteomic Pipeline) [26] was used to further process the Comet [25, 27] search results and produce a protein abundance table for each sample. In detail, PeptideProphet [28] was applied to validate peptide assignments, with filtering criteria set to probability of 0.001, accurate mass binning, non-parametric errors model (decoy model) and decoy hits reporting. In addition, iProphet [28, 29] was applied to refine peptide identifications coming from PeptideProphet. Finally, ProteinProphet [28,29,30] was applied to statistically validate peptide identifications at the protein level. This was carried out using the command: xinteract -N[my_sample_nick].pep.xml -THREADS=40 -p0.001 -l6 -PPM -OAPd -dREVERSE_ -ip [file1].pep.xml [file2].pep.xml.. [fileN].pep.xml>xinteract.out 2>xinteract.err. Then, TPP GUI was used in order to produce a protein table from the resulting ProtXML files (extension ipro.prot.xml).

Subsequently, proteins that had an identification probability < 0.9 were also removed as well as proteins supported with less than 2 unique peptides (see Supplementary Table1).

A reference database containing all 107 MAGs contigs was created (bbmap.sh command, default settings). Then, the paired-end short reads from each sample (FASTQ files) were mapped into the reference database (bbwrap.sh, default settings), producing alignment (SAM) files, which were converted into BAM format. Subsequently, a contig depth (coverage) table was produced using the command jgi_summarize_bam_contig_depths --outputDepth depth.txt --pairedContigs paired.txt *.bam. As each of the MAGs span on more than one contig, MAG depth in each sample was calculated as contig length weighted by the average depth. Finally, to account for unequal sequencing depth, each MAG depth was normalized to the number of short sequencing reads within the given sample.

In order to compare metagenomic and proteomic structures, we first calculated the mean coding gene abundance and mean production levels of each of the 1629 detected core proteins over all 12 cows. Both mean gene abundance and mean production level were translated into ranks using the R rank function. The produced proteins were ranked in descending order and the coding genes in the gene abundance vector were reordered accordingly. The two reordered ranked vectors then plotted using the R pheatmap function, and colored using the same color scale.

As our goal was to analyze plasticity in microbial protein production in varying environments, e.g., as a function of host state, only MAGs that were identified in all of the 12 proteomic samples were kept for further analysis. Consequently, only proteins that were identified in at least half of the proteomic samples (e.g., in at least six samples) were selected. This last step aimed to reduce spurious correlation results. These filtering steps retained 79 MAGs coding for a total of 1,629 measurable proteins.

In order to calculate the accuracy in predicting host feed efficiency state based on the different data layers available (16S rRNA (Supplementary Table2), metagenomics, metaproteomics), the principal component analysis (PCA) axes for all the samples based on the microbial protein production profiles were calculated. Then, twelve cycles of model building and prediction were made. Each time, the two first PCs of each of five cows along with their phenotype (efficiency state) were used to build a Support Vector Machine (SVM) [R caret package] prediction model and one sample was left out. The model was then used to perform subsequent prediction of the left-out animal phenotype (feed efficiency) by feeding the model with that animals first two PCs. This leave-one-out methodology was then repeated over all the samples. Finally, the prediction accuracy was determined as the percent of the cases where the correct label was assigned to the left-out sample. For the proteomics data, this procedure was applied on both the raw protein counts, and the protein production normalized based on MAG abundance, which enabled us to compare the prediction accuracies of the microbial protein production to that of the raw protein counts.

In order to split the proteomics dataset into microbial proteins that tend to be produced differently as a function of the host feed efficiency states, each microbial protein profile was correlated to the samples host feed efficiency measure (as calculated by RFI) using the Spearman correlation (R function cor), disregarding the p value. Proteins that had a positive correlation to RFI were grouped as inefficiency associated proteins. In contrast, proteins that presented a negative correlation to RFI were grouped as efficiency associated proteins. To test for equal sizes of these two protein groups, a binomial test was performed (R function binom.test) to examine the probability to get a low number of feed efficient proteins from the overall proteins under examination, when the expected probability was set to 0.5.

Protein functions were assigned based on the KEGG (Kegg Encyclopedia of Genes and Genomes) [31] database. The entire KEGG genes database was compiled into a Diamond [32] search library. Then, the selected microbial proteins were searched against the database using the Diamond search tool. Significant hits (evalue < 5e-5) were further analyzed to identify the corresponding KO (KEGG Ortholog number). Annotations of glycoside hydrolases were performed using dbcan2 [33].

The checkerboard distribution in protein production profiles was estimated separately within the feed efficient and inefficient animal groups. To enable the comparison between the two groups checkerboardness level, we chose a standardized C-score estimate (Standardized Effect Size C-score - S.E.S C-Score), based on the comparison of the observed C-score to a null-model distribution derived from simulations. The S.E.S C-score was estimated using the oecosimu function from R vegan package with 100,000 simulated null-model communities.

The functional redundancy within a given group of proteins was measured as the mean number of times a given KO occurred within a given group, while neglecting proteins that have not been assigned a KO level functional annotation.

In order to test whether a given group of proteins exhibits more or less functional redundancy than would have been expected, a null distribution for functional redundancy was created, based on the number of proteins in the given group. A random group of proteins was drawn from the entire set, keeping the same sample size as in the tested group, and the process was repeated 100 times. Then, the functional redundancy for each random protein group was calculated. Thereafter, the null distribution was used to obtain a p value to measure the likelihood of obtaining such a value under the null.

Examining the functional divergence between the two groups of proteins, e.g. the feed efficiency and inefficiency associated proteins, was done by first counting the amount of shared functional annotations, in terms of KOs between the two groups. Thereafter, a null distribution for the expected count of KOs was built by randomly splitting in an iterative manner the proteins into groups of the same sizes and calculating the number of shared KOs. A p value for the actual count of shared proteins was obtained by ranking the actual count over the null distribution.

ANN Ratio analysis was carried out independently for each protein function (KO), containing more than 14 proteins with at least 5 proteins within each feed efficiency group. Initially, all proteins assigned to a given KO were split into two sets, in accordance to their feed efficiency affiliation group. Thereafter, proteins within each set were independently projected into two-dimensional space by PCA applied directly to Sequence Matrix [34]. Average nearest neighbor ratio within each set was then calculated within the minimum enclosing rectangle defined by principal component axes PC1 and PC2, as defined by Clark and Evans [35].

Microorganism feed efficiency score was calculated for each MAG individually by first ranking each protein being produced by the given microbe along the 12 animals, based on the normalized protein production levels. Thereafter, a representative production value for the microbe in each animal was calculated as the average of the ranked (normalized) protein production levels in that animal (using R rank function). This ranking allowed us to alleviate the potential skewing effect of highly expressed proteins. The microorganisms Feed Efficiency Score was calculated as the difference between its mean representative production value within feed efficient animals to that within feed inefficient animals. Values close to zero will reflect similar distribution between the two animal groups, positive values will indicate higher expression among efficient animals, and negative values will indicate higher expression among inefficient animals. To calculate significance, the actual feed efficiency score was compared to values in a distribution derived from a permutation based null model. Each of the permuted Feed Efficiency Scores (10,000 for each microbe) was obtained by independently shuffling each of the proteins produced by the MAG between the animals, prior to calculating the actual microorganism feed efficiency score. By positioning the absolute score value over its distribution under permuted assumptions (absolute values), we obtained a significance p value.

In order to assess the link between phylogenetic similarity between the MAGs and their association with feed efficiency, phylogenetic tree estimating evolutionary relationships between the MAGs was constructed using the PhyloPhlAn pipeline [36]. The phylogenetic signal for Microorganism Feed Efficiency Score was estimated by providing the phylogSignal function from R phylosignal [37] package with MAGs phylogenetic tree and respective values. Pagels Lambda statistics was chosen for the analysis, owing to its robustness [38].

All bar plots, scatter plots and other point plots were generated with R package ggplot2. Heatmaps were produced by either ggplot2 [39] or pheatmap [https://cran.r-project.org/web/packages/pheatmap/index.html] R packages. KEGG map was produced using the online KEGG Mapper tool [40]. Phylocorrelogram was produced with phyloCorrelogram function from R package phylosignal [37].

MAGs that contain a minimal number of proteins (50 functions) were selected for differential protein production analysis, in order to have sufficient data to perform statistical tests. For each MAG, the relative production was used in order to calculate the Jaccard pairwise dissimilarity for core protein production between feed efficient and inefficient cows using the R vegan package. Analysis of similarity between efficiency and inefficiency associated proteins for each MAG (ANOSIM) values and p values were then calculated using the same package.

Using all GH annotated proteins, a feature table that sums the count of each GH family within each sample was produced. Thereafter a leave-one-out cross-validation (LOOCV) [R caret package] was performed, each time building a Random Forest (RF) prediction model from the GH family counts and efficiency state of 11 samples, leaving one sample outside. Each one of the RF models, in its turn, was applied on the left-out animal to predict its efficiency state. Model accuracy and AUC curve were calculated based on the LOOCV performance.

Read this article:
Metaproteome plasticity sheds light on the ecology of the rumen microbiome and its connection to host traits | The ISME Journal - Nature.com

Related Posts
August 20, 2022 at 2:48 am by Mr HomeBuilder
Category: Sheds