Using Artificial Intelligence to Explore the Biological World
By: Nigel Whittle
Head of Medical & Healthcare
3rd November 2021
Determining the shape of proteins is known as the ‘protein folding problem’ and has stood as a grand challenge in biology for the past 50 years. In 2020 the organisers of the biennial Critical Assessment of Protein Structure Prediction (CASP) competition recognised the AI system AlphaFold as a solution to this grand challenge. The area of Artificial Intelligence is rapidly changing, impacting many different medical sectors, and in this blog, I explore an application that may dramatically change the development of novel therapeutics.
The heart of the biological revolution
Since the discovery in 1953 of the double helical structure of DNA, and its role in encoding genetic information, genetics has exploded as a science, resulting in the development of the biotechnology industry and the use of genomics as a powerful health information source.
To carry out its function, DNA sequences must be converted into messages that can be used to produce proteins, which are the complex functional molecules within our bodies. The linear DNA sequence encode and are eventually translated into a linear sequence of amino acids which are the building blocks of proteins.
Proteins are involved in almost every important activity within a living organism, fighting off invading pathogens, digesting food, building structures such as muscle fibres or hair, providing oxygen to cells, acting as messengers between cells. Proteins can undertake this vast array of different functions as a direct consequence of their structure. Although they are composed of a string of amino acids in a particular order, they do not remain one-dimensional but instead fold into complex three-dimensional shapes. Since there are 20 different types of amino acid, each with specific chemical properties, and proteins can range in size from tens to thousands of amino acids, a folded protein can display a vast range of chemical and functional characteristics.
Understanding how and what shape proteins fold into in order to provide their exquisite functional specificity is therefore essential to understanding how organisms function.
The protein-folding problem
Although proteins are almost completely defined through their linear structure, it has been all but impossible to predict into what shape the protein will fold. In 1972, Nobel prize-winner Christian Anfinsen predicted that one day it would be possible to determine a protein’s three-dimensional shape based solely on its linear sequence. But for nearly 50 years this problem remained as a grand challenge for biologists.
The problem is that a protein can theoretically fold into about 10300 different conformations, and it would take an impossibly long time for a protein molecule to sample every conformational space. It is tempting to assume that proteins fold into the correct conformation as they are synthesised, block by block, but this does not appear to be the case, and unfolded (denatured) proteins can almost always be coaxed back into their correctly folded state. In many cases there are specific structures that can be identified within proteins such as a-helices and b-sheets that form secondary building blocks and contribute to the final structure.
But the key question remains, out of all the possible configurations, how does each protein spontaneously fold into one particular shape, allowing it to carry out its specific biological role? Given the importance of the 3D structure of a protein, any attempt at rational development of proteins as therapeutics is often hindered by this problem.
Every year, the organisers of CASP hold a competition to determine the most effective Artificial Intelligence system for determining protein folding. The competition is straightforward: competitors are given linear amino acid sequences for 100 proteins, and are required to predict their structures, measured against the known conformation. Last year, a new AI system, AlphaFold, developed by London-based DeepMind, outclassed all opposition, successfully predicting the structure of the test proteins to within the width of about one atom. Previously, protein structures of about 3,500 human proteins had been painstakingly determined using experimental technology such as X-ray crystallography and NMR, whereas thanks to AlphaFold the 3D structures for virtually all 20,000 such proteins are now known.
The artificial intelligence of AlphaFold
Transformers are a neural network architecture being used extensively in ML systems since introduction by Google Brain in 2017. AlphaFold’s development team created a new type of transformer designed specifically to work with three-dimensional structures.
In simple terms, a folded protein can be visualised as a ‘spatial graph’ in which amino acids are the nodes, and edges connect components in close proximity. AlphaFold attempts to interpret the structure of this graph, while reasoning with the virtual graph that it’s building. The model is structured to maximize information flow through recursive hypotheses that create increasingly accurate predictions of the underlying physical structure of the protein, and can determine highly accurate structures in a matter of days.
There are a few potential drawbacks of course: because AlphaFold was trained on publicly available datasets of known protein structures it may not accurately predict the shapes of unusual new proteins. And of course, it does not reveal the mechanism or rules of protein folding for the protein folding problem to be considered solved from an academic perspective.
DeepMind plans to release structures for nearly every protein whose genetic sequence is known to science, over one hundred million. The contribution of AI to structural biology (and most importantly to the design of innovative medicines) has begun, and Plextek is looking to play its part through its capabilities in the field. We are refining and developing our expertise at Plextek in Machine Learning and Artificial Intelligence in order to provide our clients with state-of-the-art smart systems for medical purposes that use computational processes to improve performance and utility over time. For an initial chat, please get in touch.