RESCRIPt: Reproducible sequence taxonomy reference database management for the masses

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_proquest_journals_2508015686

RESCRIPt: Reproducible sequence taxonomy reference database management for the masses

About this item

Full title

Author / Creator

Robeson, Michael S , Devon R O’rourke , Kaehler, Benjamin D , Ziemski, Michal , Dillon, Matthew R , Foster, Jeffrey T and Bokulich, Nicholas A

Publisher

Cold Spring Harbor: Cold Spring Harbor Laboratory Press

Journal title

bioRxiv, 2020-10

Language

English

Formats

Articles

Publication information

Publisher

Cold Spring Harbor: Cold Spring Harbor Laboratory Press

Subjects

Subjects and topics

More information

Scope and Contents

Contents

Abstract Background Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardizations limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a software package for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. Results To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA, and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. Conclusions RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt. Competing Interest Statement The authors have declared no competing interest. Footnotes * https://github.com/bokulich-lab/RESCRIPt * https://doi.org/10.5281/zenodo.3891931 *