Log in to save to my catalogue

Learning the molecular grammar of protein condensates from sequence determinants and embeddings

Learning the molecular grammar of protein condensates from sequence determinants and embeddings

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_8053968

Learning the molecular grammar of protein condensates from sequence determinants and embeddings

About this item

Full title

Learning the molecular grammar of protein condensates from sequence determinants and embeddings

Publisher

United States: National Academy of Sciences

Journal title

Proceedings of the National Academy of Sciences - PNAS, 2021-04, Vol.118 (15), p.1-11

Language

English

Formats

Publication information

Publisher

United States: National Academy of Sciences

More information

Scope and Contents

Contents

Intracellular phase separation of proteins into biomolecular condensates is increasingly recognized as a process with a key role in cellular compartmentalization and regulation. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed, with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, we established an in silico strategy for understanding on a global level the associations between protein sequence and phase behavior and further constructed machine-learning models for predicting protein liquid–liquid phase separation (LLPS). Our analysis highlighted that LLPS-prone proteins are more disordered, less hydrophobic, and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss-Prot database and that they show a fine balance in their relative content of polar and hydrophobic residues. To further learn in a hypothesis-free manner the sequence features underpinning LLPS, we trained a neural network-based language model and found that a classifier constructed on such embeddings learned the underlying principles of phase behavior at a comparable accuracy to a classifier that used knowledge-based features. By combining knowledge-based features with unsupervised embeddings, we generated an integrated model that distinguished LLPS-prone sequences both from structured proteins and from unstructured proteins with a lower LLPS propensity and further identified such sequences from the human proteome at a high accuracy. These results provide a platform rooted in molecular principles for understanding protein phase behavior. The predictor, termed DeePhase, is accessible from https://deephase.ch.cam.ac.uk/....

Alternative Titles

Full title

Learning the molecular grammar of protein condensates from sequence determinants and embeddings

Identifiers

Primary Identifiers

Record Identifier

TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_8053968

Permalink

https://devfeature-collection.sl.nsw.gov.au/record/TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_8053968

Other Identifiers

ISSN

0027-8424

E-ISSN

1091-6490

DOI

10.1073/pnas.2019053118

How to access this item