EnzChemRED, a rich enzyme chemistry relation extraction dataset
EnzChemRED, a rich enzyme chemistry relation extraction dataset
About this item
Full title
Author / Creator
Po-Ting Lai , Coudert, Elisabeth , Aimo, Lucila , Axelsen, Kristian , Breuza, Lionel , de Castro, Edouard , Feuermann, Marc , Morgat, Anne , Pourcel, Lucille , Pedruzzi, Ivo , Poux, Sylvain , Redaschi, Nicole , Rivoire, Catherine , Sveshnikova, Anastasia , Chih-Hsuan Wei , Leaman, Robert , Luo, Ling , Lu, Zhiyong and Bridge, Alan
Publisher
Ithaca: Cornell University Library, arXiv.org
Journal title
Language
English
Formats
Publication information
Publisher
Ithaca: Cornell University Library, arXiv.org
Subjects
More information
Scope and Contents
Contents
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/....
Alternative Titles
Full title
EnzChemRED, a rich enzyme chemistry relation extraction dataset
Authors, Artists and Contributors
Author / Creator
Identifiers
Primary Identifiers
Record Identifier
TN_cdi_proquest_journals_3044062169
Permalink
https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_proquest_journals_3044062169
Other Identifiers
E-ISSN
2331-8422