Improving contig binning of metagenomic data using d 2 S $$ {d}_2^S $$ oligonucleotide frequency dis...
Improving contig binning of metagenomic data using d 2 S $$ {d}_2^S $$ oligonucleotide frequency dissimilarity
About this item
Full title
Author / Creator
Ying Wang , Kun Wang , Yang Young Lu and Fengzhu Sun
Publisher
BMC
Journal title
Language
English
Formats
Publication information
Publisher
BMC
Subjects
More information
Scope and Contents
Contents
Abstract Background Metagenomics sequencing provides deep insights into microbial communities. To investigate their taxonomic structure, binning assembled contigs into discrete clusters is critical. Many binning algorithms have been developed, but their performance is not always satisfactory, especially for complex microbial communities, calling for further development. Results According to previous studies, relative sequence compositions are similar across different regions of the same genome, but they differ between distinct genomes. Generally, current tools have used the normalized frequency of k-tuples directly, but this represents an absolute, not relative, sequence composition. Therefore, we attempted to model contigs using relative k-tuple composition, followed by measuring dissimilarity between contigs using d 2 S $$ {d}_2^S $$ . The d 2 S $$ {d}_2^S $$ was designed to measure the dissimilarity between two long sequences or Next-Generation Sequencing data with the Markov models of the background genomes. This method was effective in revealing group and gradient relationships between genomes, metagenomes and metatranscriptomes. With many binning tools available, we do not try to bin contigs from scratch. Instead, we developed d 2 S Bin $$ {d}_2^S\mathrm{Bin} $$ to adjust contigs among bins based on the output of existing binning tools for a single metagenomic sample. The tool is taxonomy-free and depends only on k-tuples. To evaluate the performance of d 2 S Bin $$ {d}_2^S\mathrm{Bin} $$ , five widely used binning tools with different strategies of sequence composition or the hybrid of sequence composition and abundance were selected to bin six synthetic and real datasets, after which d 2 S Bin $$ {d}_2^S\mathrm{Bin} $$ was applied to adjust the binning results. Our experiments showed that d 2 S Bin $$ {d}_2^S\mathrm{Bin} $$ consistently achieves the best performance with tuple length k = 6 under the independent identically distributed (i.i.d.) background model. Using the metrics of recall, precision and ARI (Adjusted Rand Index), d 2 S B i n $$ {d}_2^S\mathrm{Bin} $$ improves the binning performance in 28 out of 30 testing experiments (6 datasets with 5 binning tools). The d 2 S Bin $$ {d}_2^S\mathrm{Bin} $$ is available at https://github.com/kunWangkun/d2SBin . Conclusions Experiments showed that d 2 S $$ {d}_2^S $$ accurately measures the dissimilarity between contigs of metagenomic reads and that relative sequence composition is more reasonable to bin the contigs. The d 2 S Bin $$ {d}_2^S\mathrm{Bin} $$ can be applied to any existing contig-binning tools for single metagenomic samples to obtain b...
Alternative Titles
Full title
Improving contig binning of metagenomic data using d 2 S $$ {d}_2^S $$ oligonucleotide frequency dissimilarity
Authors, Artists and Contributors
Author / Creator
Identifiers
Primary Identifiers
Record Identifier
TN_cdi_doaj_primary_oai_doaj_org_article_c25fd15983df4e0c9e2f7dbe37a4f132
Permalink
https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_doaj_primary_oai_doaj_org_article_c25fd15983df4e0c9e2f7dbe37a4f132
Other Identifiers
E-ISSN
1471-2105
DOI
10.1186/s12859-017-1835-1