Page tree
Skip to end of metadata
Go to start of metadata

© Image 2008 Wikipedia
Metagenomics is about finding microbes that are present in an environment, determining what they do, and figuring out how they do it by using high-throughput DNA sequencing technology. It is a common technique that has revolutionized environmental microbiology by allowing a researcher to study the genes and genomes in DNA extracted from an environment.

Several steps to building a metagenomics data set

Metagenomics is about finding which microbes are present in the environment, what they are doing, and how they are doing it. © 2015 Center for Biological Sequence Analysis at the Technical University of Denmark

  1. First, a sample must be collected and DNA must be extracted from the sample.  
  2. Second, the DNA must be multiplied until there are enough copies to sequence; this is the sequencing library.  
  3. Third, the sequencing library is sequenced to yield the raw DNA data.  

In the case of the Illumina HiSeq platform, these are 100 to 150-letter strings of A,T,G, and C characters with quality scores attached to each character, also called sequencing "reads".  The fourth step is usually to assemble the reads into larger strings called contigs, with the goal of building one contig for each physically distinct piece of DNA (microbial chromosomes and plasmids) in the metagenome.  This is a critical step, but because of global and local repeats in DNA sequence, each physical piece of DNA is usually still represented by many contigs that cannot be connected into a single contig.  This assembly problem is a major topic of research at the intersection of computer science and microbiology.  Once the reads have been assembled into contigs, genes can be predicted from the contigs and contigs can be clustered into draft genomes using common machine-learning algorithms.  Analysis of gene content and draft genomes is usually done by bioinformaticians in collaboration with microbiologists and molecular biologists.

Metagenomics relies heavily on the Illumina HiSeq sequencing platforms, which deliver the most DNA data for the lowest price.

Many DNA sequencing machines © 2005 Wikipedia
Getting as much DNA data as possible is important for samples with a many diverse microbes (like the human intestine), since researchers will only be able to assemble parts of the genomes in the sample if the do not collect reads from each genome.  On the other hand, the shortness of the reads (100-150 nucleotides) makes it harder to assemble full contigs that represent whole chromosomes and plasmids than it would be with longer reads (such as PacBio reads with an average read length on the order of 8,000 nucleotides).  This may change in the future as long-read sequencing platforms like the PacBio Sequel and the Oxford MinIon improve in price and quality.

Metagenomic assembly is a key step that is under research and development.

De-Bruijn-graph-based assembly software such as IDBARay MetaAll Paths-LG, and Meta Velvet-SL, among others, are popular open-source tools for assembling short reads from metagenomes. Each assembler has unique advantages. Assembly quality, as measured by contig length, is ultimately limited by the read length.  Repeat regions that cannot be spanned by reads will prevent contigs from expanding.

The effect of repeats on assembly © GCAT at Davidson College
Other methods to improve assembly include machine-learning algorithms that cluster contigs or reads based on the tetranucleotide frequency of each contig.   A recently developed and very effective approach to clustering contigs or reads is to utilize both the compositional information (tetranucleotide frequency or kmer) in combination with the average coverage of each contig or read in two or more metagenomes from the same environment (such as in a time series of samples from a single site).  The coverage of the same contig or read in two different samples will change in lock-step with the change in abundance of its source chromosome in the sample, meaning that all the contigs from the same chromosome will co-vary in coverage.  MetaBats uses a combination of covariance and tetranucleotide frequency to cluster contigs in an assembly from a De-Bruijn-graph-based assembler.  The clusters can then be used to extract reads from the original DNA data and re-assembled more effectively with a De-Bruijn-graph-based assembler. Alternatively, Latent Strain Analysis uses kmer counts across multiple metagenomes to cluster reads prior to any assembly.  Clustered reads can then be assembled more effectively in a cluster-by-cluster fashion.

© Text 2015 Wikipedia

Importance of Metagenomics for several industries

We should add some more information here.

Medical importance - microbiome engineering

Industrial importance - quickly converging on solutions for industrial enzyme improvement

Research importance - find new forms of life that cannot be cultured

Companies working on this

What do you think?

About the authors

View full profile Nathan Brown
View full profile Jérôme Lutz from Berlin & Munich, Germany

I like to share the great things I discover daily while researching and working in the field of Synthetic Biology.

When I talk to people about it, they often refer to Science Fiction. However, when I send them links to this wiki and they read through those pages, they start understanding that this is real and it's happening right now.