Protein Wars: ESMFold vs AlphaFold

While AlphaFold 2 and RoseTTAFold have similar accuracy, ESMFold inference is faster at enabling the exploration of structural spaces of metagenomic proteins

Share

Illustration by Made using DALL.E

Published on August 24, 2022

by Amit Raja Naik

Last month, Meta AI’s researchers launched a breakthrough model called Evolutionary Scale Modeling, or ESM, for protein structure prediction. This new model is touted to be one of the closest alternatives to DeepMind’s AlphaFold 2, which essentially solved the 50-year-old grand challenge of protein folding. Over the years, Meta AI has launched several models, and its most recent work has been released to the public.

Check out the GitHub repository here.

Besides ESMFold and AlphaFold, there are plenty of protein prediction models, including RoseTTAFold, IntFOLD, RaptorX and others. Here’s a quick overview of the models:

ESMFold vs AlphaFold

Meta AI claimed that AlphaFold 2 and RoseTTAFold have similar accuracy, but ESMFold inference is faster at enabling the exploration of structural spaces of metagenomic proteins. Metagenomics is a technique of sequencing DNA purified directly from a natural environment.

We have trained ESMFold to predict full atomic protein structure directly from language model representations of a single sequence. Accuracy is competitive with AlphaFold on most proteins with order of magnitude faster inference. By @MetaAI Protein Team.
https://t.co/APVoaawyOb pic.twitter.com/f6DvSfjuOX
— Alex Rives (@alexrives) July 21, 2022

While AlphaFold uses a network-based model, ESMFold leverages a large-scale language model for protein prediction. Meta AI team said that the improvements in language modelling perplexity and structure learning continue through 15 billion parameters. In comparison, the team said their latest model, ESM2, at 15 million parameters, is better than their older model, ESM1b, at 650 million parameters.

In addition, AlphaFold 2 and other alternatives use multiple sequence alignments (MSAs) and templates of similar proteins to achieve optimal performance or breakthrough success in atomic-resolution structure prediction. However, ESMFold generates structure prediction using only one sequence as input by leveraging the internal representations of the language model.

With a single sequence as input, ESMFold produces more accurate atomic-level predictions than AlphaFold and competes with RoseTTAFold when given full multiple sequence alignments (MSAs).

Amazing. We did see this also come up in ProGen – Large language models captured 3d structure through its attention.https://t.co/0oK7coEMV6
— Richard Socher (@RichardSocher) July 24, 2022

ESMFold produces comparable predictions for low-perplexity sequences, and that structure prediction accuracy correlates with language model perplexity in general. In other words, when a language model can better comprehend a sequence, it can comprehend a structure better.

One of the advantages of ESMFold is that it offers a faster prediction speed than existing atomic resolution structure predictors. This, in a way, allows it to bridge the gap between the rapid growth of protein sequence databases containing billions of sequences alongside the slower development of protein structure and function databases. The model is used to rapidly compute one million predicted structures representing a diverse subset of metagenomic sequence spaces that lacks labelled structure or function.

Last month, DeepMind, in collaboration with European Bioinformatics Institute (EMBL-EBI), released predicted structures for nearly all catalogued proteins, which will expand the AlphaFold database by over 200x – from nearly 1 million structures to over 200 million structures – with the potential to increase our understanding of biology significantly.

AlphaFold, initially launched in 2018, published its second version in 2020, and released an open-source version of its deep-learning neural network AlphaFold 2 last year. With this, the team said that the new model significantly increases the accuracy of predicted multimeric interfaces over input-adapted single-chain AlphaFold, while maintaining high intra-chain accuracy.

One of the biggest performance drivers for ESMFold has been the language model. For instance, when ESM-2 understands the protein sequence well, you can obtain predictions comparable to those made by other models when language modelling perplexity is high. In other words, it is possible to obtain accurate atomic resolution structure predictions with ESMFold – i.e. up to two orders of magnitude faster than AlphaFold 2.

Meta AI said billions of protein sequences have unknown structures and functions, many from metagenomic sequencing. ESMFold makes it possible to map this structural space in practical timescales, where they can fold a random sample of 1 million metagenomic sequences in a few hours. Moreover, the researchers believe that ESMFold can help to understand regions of protein space that are distant from existing knowledge.

A new ‘super fast’ protein-predicting model emerges

ESMFold and AlphaFold are not alone. OmegaFold, developed by Chinese biotech firm Helixon, also predicts high-resolution protein structure from a single primary sequence. Recently, this model outperformed rival RoseTTAFold while achieving similar prediction accuracy to AlphaFold 2.

OmegaFold's code and model1 is released:https://t.co/QNS01ITjkM
— Jian Peng (@peng_illinois) August 3, 2022

Only recently, the company made its code publicly available, joining the likes of AlphaFold and ESMFold, which are also open source.

Why is this a big deal?

The folding of proteins helps researchers and scientists understand the underlying cause of many diseases. Knowing these protein folding, protein design, etc., helps find a cure, design new medicines, drugs, pharmaceutical solutions, etc.

📣 Want to advertise in AIM? Book here

Amit Raja Naik

Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.