Week 05 Lecture: Protein Structure Prediction in the Age of Artificial Intelligence

High-accuracy structural biology at the intersection of machine learning and physical modeling.

Protein sequence - structure - function paradigm

sequence-structure-function homology

Proteins with similar amino acid sequence have similar structure and similar function.

If we know the sequence, can we predict the structure?
If we know the structure, can we understand function?

Levels of protein structure

protein-structure-hierarchy

Secondary structure prediction

Neural networks were used early on for secondary structure prediction.

karplus-neuralnetwork-secondary

Holley, L.H. and Karplus, M., 1989. Protein secondary structure prediction with a neural network. PNAS, 86, 152-156

Neural networks were used early on for secondary structure prediction.

cohen-langridge-neuralnetwork-secondary

Kneller, D.G., Cohen, F.E. and Langridge, R., 1990. Improvements in protein secondary structure prediction by an enhanced neural network. J. Mol. Biol., 214, 171-182

Weight matrices in early networks were simple.

cohen-langridge-neuralnetwork-secondary-weightmatrix

Kneller, D.G., Cohen, F.E. and Langridge, R., 1990. Improvements in protein secondary structure prediction by an enhanced neural network. J. Mol. Biol., 214, 171-182

Improved networks learn from sequence alignments.

rost-sander-neuralnetwork-secondary

Rost, B. and Sander, C., 1993. Improved prediction of protein secondary structure by use of sequence profiles and neural networks. PNAS, 90, 7558-7562

Improved networks learn from sequence alignments (encoded with more sensitive position-specific iterated alignment matrices).

psipred-network

Jones, D.T., 1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, 195-202

Secondary structure prediction accuracies routinely reached 80% by 2002 (CASP5).

Aloy, P., Stark, A., Hadley, C. and Russell, R.B., 2003. Predictions without templates: new folds, secondary structure, and contacts in CASP5. Proteins, 53, 436-456

PSIPRED server gives reliable answers in a few minutes.

http://bioinf.cs.ucl.ac.uk/psipred

psipred-output

Simulations can predict secondary structures as well, but with much more computational effort and results are very sensitive to the choice of force field.

dill-peptidesampling

Shell, M.S., Ritterson, R. and Dill, K.A., 2008. A test on peptide stability of AMBER force fields with implicit solvation. The J. Phys. Chem. B, 112, 6878-6886

Tertiary structure prediction

Tertiary protein structure prediction accuracy has improved dramatically in recent years.

casp progress

CASP | GDT

Homology modeling

Structures are predicted based on sequence-structure similarity.

sequence-structure-function homology

If we know structure for one protein, we can generate a good model for another protein with a similar sequence (e.g. Human thioredoxin based on E. coli). This requires good sequence alignment and model building tools to mutate side chains and add missing parts.

SWISS-MODEL | Modeller

Improved protocols assemble structures from homologous and de novo fragments.

i-tasser

I-Tasser | Rosetta

Physics-based protein folding

Molecular dynamics simulations can fold proteins, but with great computational effort.

shaw-folding

Lindorff-Larsen, K., Piana, S., Dror, R.O. and Shaw, D.E., 2011. How fast-folding proteins fold. Science, 334, 517-520

Molecular dynamics simulations can also refine predictions.

refinement_via_md

Heo, L. and Feig, M., 2018. Experimental accuracy in protein structure refinement via molecular dynamics simulations. PNAS, 115, 13276-13281.

Coeevolutionary couplings

Residue contacts are extracted from multiple sequence alignments.

coevolutionary coupling

Homology modeling without contact restraints

Modeling with contact restraints

AlphaFold2

Introduction of attention-based transformer architecture with iterative refinement and directly generates structures (end-to-end predictions).

af2-overall

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A. and Bridgland, A., 2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583-589.

AlphaFold2’s ‘evoformer’ convolutes co-evolutionary information from the multiple sequence alignment with structural information (initially from templates, then iteratively from models that are generated). The architecture uses elements of graph neural networks to maintain equivariance.

af2-evoformer

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A. and Bridgland, A., 2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583-589.

AlphaFold2’s structure module directly generates structures via an attention-based module trained on actual structures but also augmented with physical constraints. Final models are generated after minimization with Amber (not shown).

af2-structuremodule

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A. and Bridgland, A., 2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583-589.

AlphaFold2 models are remarkably accurate at the atomistic detail, including for side chains.

af2 prediction

Errors are generally around 1Å RMSD but may be larger in loop regions or parts of a structure subject to contacts with other units, for example due to crystal packing.

Errors are generally around 1Å RMSD but may be larger in loop regions or parts of a structure subject to contacts with other units, for example due to crystal packing.

af2-errors

Heo, L., Janson, G. and Feig, M., 2021. Physics‐based protein structure refinement in the era of artificial intelligence. Proteins, 89, 1870-1887.

What happens when AlphaFold2 structures are used in simulations?

af2-md

Heo, L., Janson, G. and Feig, M., 2021. Physics‐based protein structure refinement in the era of artificial intelligence. Proteins, 89, 1870-1887.

AlphaFold2 is very good at generating one structure, but what if there are multiple functional states?

GPCR-predictions

Heo, L., Feig, M. 2021. Multi-State Modeling of G-protein Coupled Receptors at Experimental Accuracy. bioRxiv, 2021.11.26.470086

RoseTTAfold is an alternative but conceptually similar approach to provide structures at accuracy approaching AlphaFold2.

RosetTTAfold architecture

Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G.R., Wang, J., Cong, Q., Kinch, L.N., Schaeffer, R.D. and Millán, C., 2021. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373, 871-876.

Quaternary structure prediction

Predictions of complexes have remained a challenge, but much progress is being made:

Complex predictions

Bryant, P., Pozzati, G. and Elofsson, A., 2021. Improved prediction of protein-protein interactions using AlphaFold2 and extended multiple-sequence alignments. BioRxiv 2021.09.15.460468.

Evans, R., O’Neill, M., Pritzel, A., Antropova, N., Senior, A.W., Green, T., Žídek, A., Bates, R., Blackwell, S., Yim, J. and Ronneberger, O., 2021. Protein complex prediction with AlphaFold-Multimer. BioRxiv 2021.10.04.463034.

Are we done?

“At present, for the best cases, the C-alpha coordinate RMSD accuracy of AlphaFold-predicted structures roughly corresponds to the accuracy expected for structures determined at resolutions no better than ∼4 Å. Thus, although structural predictions by AlphaFold and RoseTTAfold may be accurate enough to assist with experimental structure determination, they alone cannot provide the kind of detailed understanding of molecular and chemical interactions that is required for studies of molecular mechanisms and for structure-based drug design.”

and

“… solving the protein-folding problem means making accurate predictions of structures from amino acid sequences starting from first principles based on the underlying physics and chemistry.”

Moore, P.B., Hendrickson, W.A., Henderson, R. and Brunger, A.T., 2022. The protein-folding problem: Not yet solved. Science, 375, 507-507.