Evaluating techniques for metagenome annotation using simulated sequence data

Richard James Randle-Boggis, Thorunn Helgason, Melanie Sapp, Peter D Ashton

Research output: Contribution to journalArticlepeer-review


The advent of next-generation sequencing has allowed huge amounts of DNA
sequence data to be produced, advancing the capabilities of microbial ecosystem studies. The
current challenge is identifying from which microorganisms and genes the DNA originated.
Several tools and databases are available for annotating DNA sequences. The tools, databases
and parameters used can have a significant impact on the results: naïve choice of these factors
can result in a false representation of community composition and function. We use a
simulated metagenome to show how different parameters affect annotation accuracy by
evaluating the sequence annotation performances of MEGAN, MG-RAST, One Codex and
Megablast. This simulated metagenome allowed the recovery of known organism and
function abundances to be quantitatively evaluated, which is not possible for environmental
metagenomes. The performance of each program and database varied, e.g. One Codex
correctly annotated many sequences at the genus level, whereas MG-RAST RefSeq produced
many false positive annotations. This effect decreased as the taxonomic level investigated
increased. Selecting more stringent parameters decreases the annotation sensitivity, but
increases precision. Ultimately, there is a trade-off between taxonomic resolution and
annotation accuracy. These results should be considered when annotating metagenomes and
interpreting results from previous studies.
Original languageEnglish
Number of pages39
Early online date8 May 2016
Publication statusPublished - 11 May 2016

Bibliographical note

© FEMS 2016.

Cite this