Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in...
Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
About this item
Full title
Author / Creator
Publisher
England: BioMed Central Ltd
Journal title
Language
English
Formats
Publication information
Publisher
England: BioMed Central Ltd
Subjects
More information
Scope and Contents
Contents
Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to “complete” model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3):
https://github.com/martin-steinegger/conterminator...
Alternative Titles
Full title
Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
Authors, Artists and Contributors
Author / Creator
Identifiers
Primary Identifiers
Record Identifier
TN_cdi_doaj_primary_oai_doaj_org_article_0c58db75fd8a487eb02d1ef019ea16a4
Permalink
https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_doaj_primary_oai_doaj_org_article_0c58db75fd8a487eb02d1ef019ea16a4
Other Identifiers
ISSN
1474-760X,1474-7596
E-ISSN
1474-760X
DOI
10.1186/s13059-020-02023-1