Nikos Kyrpides, Joint Genome Institute Co-authors: Kostas Mavrommatis, Natalia Ivanova Metagenomics has emerged as a powerful tool for exploration of the functional capabilities of microbial communities regardless of the ability of their members to grow in pure culture. However, little is known for the efficacy of the methods used to process these datasets. Further more, sequencing of environmental genomic DNA usually results in large, highly fragmented datasets representing a significant challenge for downstream functional analysis and interpretation due to insufficient sequence coverage preventing assembly of individual reads, higher sequence error rate (frame shifts) resulting in disruption of protein coding sequences, and unsatisfactory performance of assembly and gene prediction tools. In addition, many metagenomic datasets derived from the communities with one or a few dominant genera exhibit high degree of redundancy due to the presence of very similar, but not identical sequences originating from different strains and species of the same genus. As a result, many of the analyses routinely performed on finished and draft isolate genomes, such as identification of secreted and multi domain proteins in metagenomic datasets or reconstruction of phylogenetic trees becomes very labor-intensive, if not impossible. We will discuss the evaluation of methods used for the analysis of metagenomic datasets, as well as a method for post-processing of metagenomic datasets aiming to correcting many of the above errors and reducing the redundancy of metagenomic datasets.
« Hide