Week 05 Lecture: Protein Structure Prediction in the Age of Artificial Intelligence¶

High-accuracy structural biology at the intersection of machine learning and physical modeling.

Protein sequence - structure - function paradigm¶

Proteins with similar amino acid sequence have similar structure and similar function.

If we know the sequence, can we predict the structure?
If we know the structure, can we understand function?

Levels of protein structure¶

Secondary structure prediction¶

Neural networks were used early on for secondary structure prediction.

Holley, L.H. and Karplus, M., 1989. Protein secondary structure prediction with a neural network. PNAS, 86, 152-156

Neural networks were used early on for secondary structure prediction.

Kneller, D.G., Cohen, F.E. and Langridge, R., 1990. Improvements in protein secondary structure prediction by an enhanced neural network. J. Mol. Biol., 214, 171-182

Weight matrices in early networks were simple.

cohen-langridge-neuralnetwork-secondary-weightmatrix

Kneller, D.G., Cohen, F.E. and Langridge, R., 1990. Improvements in protein secondary structure prediction by an enhanced neural network. J. Mol. Biol., 214, 171-182

Improved networks learn from sequence alignments.

Rost, B. and Sander, C., 1993. Improved prediction of protein secondary structure by use of sequence profiles and neural networks. PNAS, 90, 7558-7562

Improved networks learn from sequence alignments (encoded with more sensitive position-specific iterated alignment matrices).

Jones, D.T., 1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, 195-202

Secondary structure prediction accuracies routinely reached 80% by 2002 (CASP5).

Aloy, P., Stark, A., Hadley, C. and Russell, R.B., 2003. Predictions without templates: new folds, secondary structure, and contacts in CASP5. Proteins, 53, 436-456

PSIPRED server gives reliable answers in a few minutes.

http://bioinf.cs.ucl.ac.uk/psipred

Simulations can predict secondary structures as well, but with much more computational effort and results are very sensitive to the choice of force field.

Shell, M.S., Ritterson, R. and Dill, K.A., 2008. A test on peptide stability of AMBER force fields with implicit solvation. The J. Phys. Chem. B, 112, 6878-6886

Tertiary structure prediction¶

Tertiary protein structure prediction accuracy has improved dramatically in recent years.

CASP | GDT

Homology modeling¶

Structures are predicted based on sequence-structure similarity.

If we know structure for one protein, we can generate a good model for another protein with a similar sequence (e.g. Human thioredoxin based on E. coli). This requires good sequence alignment and model building tools to mutate side chains and add missing parts.

SWISS-MODEL | Modeller

Improved protocols assemble structures from homologous and de novo fragments.

I-Tasser | Rosetta

Physics-based protein folding¶

Molecular dynamics simulations can fold proteins, but with great computational effort.

Lindorff-Larsen, K., Piana, S., Dror, R.O. and Shaw, D.E., 2011. How fast-folding proteins fold. Science, 334, 517-520

Molecular dynamics simulations can also refine predictions.

Heo, L. and Feig, M., 2018. Experimental accuracy in protein structure refinement via molecular dynamics simulations. PNAS, 115, 13276-13281.

Coeevolutionary couplings¶

Residue contacts are extracted from multiple sequence alignments.

Homology modeling without contact restraints¶

Modeling with contact restraints¶

Contact predictions via machine learning¶

Initial methods were based on statistical inference, but ML methods perform better.

Ma, J., Wang, S., Wang, Z. and Xu, J., 2015. Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning. Bioinformatics, 31, 3506-3513.

Wang, S., Sun, S., Li, Z., Zhang, R. and Xu, J., 2017. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comp. Biol., 13, e1005324.

AlphaFold 1¶

Combined distograms (and backbone torsion angles) predicted from coevolutionary couplings with generative models trained to maximize GDT similarity scores.

Senior, A.W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žídek, A., Nelson, A.W., Bridgland, A. and Penedones, H., 2019. Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13). Proteins, 87, 1141-1148.

AlphaFold 1 generated actual structures either via a complex simulated annealing protocol or via gradient descent using potentials augmented by other potentials (Rosetta etc.)

Senior, A.W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žídek, A., Nelson, A.W., Bridgland, A. and Penedones, H., 2019. Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13). Proteins, 87, 1141-1148.

trRosetta improves over AlphaFold1 by predicting residue distances and relative orientations

Yang, J., Anishchenko, I., Park, H., Peng, Z., Ovchinnikov, S. and Baker, D., 2020. Improved protein structure prediction using predicted interresidue orientations. PNAS, 117, 1496-1503.

AlphaFold2¶

Introduction of attention-based transformer architecture with iterative refinement and directly generates structures (end-to-end predictions).

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A. and Bridgland, A., 2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583-589.

AlphaFold2’s ‘evoformer’ convolutes co-evolutionary information from the multiple sequence alignment with structural information (initially from templates, then iteratively from models that are generated). The architecture uses elements of graph neural networks to maintain equivariance.

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A. and Bridgland, A., 2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583-589.

AlphaFold2’s structure module directly generates structures via an attention-based module trained on actual structures but also augmented with physical constraints. Final models are generated after minimization with Amber (not shown).

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A. and Bridgland, A., 2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583-589.

AlphaFold2 models are remarkably accurate at the atomistic detail, including for side chains.

Errors are generally around 1Å RMSD but may be larger in loop regions or parts of a structure subject to contacts with other units, for example due to crystal packing.

Heo, L., Janson, G. and Feig, M., 2021. Physics‐based protein structure refinement in the era of artificial intelligence. Proteins, 89, 1870-1887.

What happens when AlphaFold2 structures are used in simulations?

Heo, L., Janson, G. and Feig, M., 2021. Physics‐based protein structure refinement in the era of artificial intelligence. Proteins, 89, 1870-1887.

AlphaFold2 is very good at generating one structure, but what if there are multiple functional states?

Heo, L., Feig, M. 2021. Multi-State Modeling of G-protein Coupled Receptors at Experimental Accuracy. bioRxiv, 2021.11.26.470086

RoseTTAfold is an alternative but conceptually similar approach to provide structures at accuracy approaching AlphaFold2.

Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G.R., Wang, J., Cong, Q., Kinch, L.N., Schaeffer, R.D. and Millán, C., 2021. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373, 871-876.

Quaternary structure prediction¶

Predictions of complexes have remained a challenge, but much progress is being made:

Bryant, P., Pozzati, G. and Elofsson, A., 2021. Improved prediction of protein-protein interactions using AlphaFold2 and extended multiple-sequence alignments. BioRxiv 2021.09.15.460468.

Evans, R., O’Neill, M., Pritzel, A., Antropova, N., Senior, A.W., Green, T., Žídek, A., Bates, R., Blackwell, S., Yim, J. and Ronneberger, O., 2021. Protein complex prediction with AlphaFold-Multimer. BioRxiv 2021.10.04.463034.

Are we done?¶

“At present, for the best cases, the C-alpha coordinate RMSD accuracy of AlphaFold-predicted structures roughly corresponds to the accuracy expected for structures determined at resolutions no better than ∼4 Å. Thus, although structural predictions by AlphaFold and RoseTTAfold may be accurate enough to assist with experimental structure determination, they alone cannot provide the kind of detailed understanding of molecular and chemical interactions that is required for studies of molecular mechanisms and for structure-based drug design.”

and

“… solving the protein-folding problem means making accurate predictions of structures from amino acid sequences starting from first principles based on the underlying physics and chemistry.”

Moore, P.B., Hendrickson, W.A., Henderson, R. and Brunger, A.T., 2022. The protein-folding problem: Not yet solved. Science, 375, 507-507.

Machine Learning for Molecular Dynamics

Week 05 Lecture: Protein Structure Prediction in the Age of Artificial Intelligence

Contents

Week 05 Lecture: Protein Structure Prediction in the Age of Artificial Intelligence¶

Protein sequence - structure - function paradigm¶

Levels of protein structure¶

Secondary structure prediction¶

Tertiary structure prediction¶

Homology modeling¶

Physics-based protein folding¶

Coeevolutionary couplings¶

Homology modeling without contact restraints¶

Modeling with contact restraints¶

Contact predictions via machine learning¶

AlphaFold 1¶

AlphaFold2¶

Quaternary structure prediction¶

Are we done?¶