Week 05 Lab: Structure Prediction
Contents
Week 05 Lab: Structure Prediction¶
Student Name: YOUR NAME HERE¶
Predicting protein structures¶
In this lab we will predict and compare protein structures that were generated via different methods.
As a test system we will predict a recently solved crystal structure. The experimental structure will be our reference agains we can compare the predictions.
Many prediction tools use structural databases based on the PDB that are updated regularly. We will use a structure that was made available just yesterday so that it is less likely that structural databases used by different prediction servers have been updated already to reflect the newly solved structure.
The following three PDBs are good candidates: 7EFY 7TRV 7E90
In the following the notebook contains the results for the first (7EFY).
Your task will be to study this esample and then rerun as much as you can for one of the other targets. The third target (7E90) can probably be done faster and is recommended.
pdbid='7TRV'
pdbid='7EFY'
Setting up a working directory¶
We may want to store all of the data in a separate working directory. This would be useful if we work on multiple targets. Let’s set it up below:
import os
# if we want to store files in current directory uncomment below:
# dir='.'
dir=f'work-{pdbid.lower()}'
if not os.path.isdir(dir):
os.mkdir(dir)
Getting files from the PDB¶
We begin by downloading files and information from the PDB
import requests
import os
def get_pdb_from_rcsb(pdb_id):
response=requests.get(f'http://files.rcsb.org/download/{pdb_id}.pdb')
return(response.text)
def get_uniprot_id_from_rcsb(pdb_id):
query='query={polymer_entity(entry_id:"'+pdb_id+'", entity_id:"1"){uniprots{rcsb_id}}}'
response=requests.get('http://data.rcsb.org/graphql',params=query)
return(response.json()['data']['polymer_entity']['uniprots'][0]['rcsb_id'])
def get_title_from_rcsb(pdb_id):
query='query={entry(entry_id:"'+pdb_id+'"){struct{title}}}'
response=requests.get('http://data.rcsb.org/graphql',params=query)
return(response.json()['data']['entry']['struct']['title'])
def get_fasta_from_uniprot(uniprot_id):
response=requests.get(f'https://www.uniprot.org/uniprot/{uniprot_id}.fasta')
return(response.text)
title=get_title_from_rcsb(pdbid)
uniprot=get_uniprot_id_from_rcsb(pdbid)
pdbfile=f'{dir}/{pdbid}.pdb'
if not os.path.isfile(pdbfile):
contents=get_pdb_from_rcsb(pdbid)
out=open(pdbfile,'w')
out.write(contents)
out.close()
fastafile=f'{dir}/{uniprot}.fasta'
if not os.path.isfile(fastafile):
contents=get_fasta_from_uniprot(uniprot)
out=open(fastafile,'w')
out.write(contents)
out.close()
print(title)
print(f'PDB ID: {pdbid}')
print(f'Uniprot ID: {uniprot}')
print()
with open(fastafile) as f: fasta = f.read()
print(fasta)
Crystal structure of retroviral protease-like domain of Ddi1 from Cryptosporidium hominis
PDB ID: 7EFY
Uniprot ID: A0A0S4TE75
>tr|A0A0S4TE75|A0A0S4TE75_CRYHO UBA domain-containing protein OS=Cryptosporidium hominis OX=237895 GN=CHUDEA3_2190 PE=3 SV=1
MKVNILYSVTGELISVEIHEETTVEVLKSLIEIELNLDSFKNYELSIEGKILQNDGFVLS
DIVNEGSIFVLSEKTNKNTFENSNFSLNDQENLIQLLSKELFTKIKEDQSLKSVYYSKSS
KYRDSIDKDDIKSFTELIKNDYFSGNLNSIPSSSNSSSNLYNLDPLSPEYQRLIEEQVRK
QNVEENLILAQDHLPESFTQVHMLYINAEVNGISIKAFVDSGAQTTIMSKKCAEKCNLVR
LIDYRFSGIAQGVGTSKIVGKIHVAQMKIGNSFFPFSITVLEESHVDFLFGLDLLKRYQC
CIDLHQNALIIGDEKVQFLSESEINSEISRIDPNNEYDQCFEQGKKLQLLSLGFSESQVI
NALRATNGNTELAASLLFSNETFE
Once loaded we can look at the PDB structure
import mdtraj as md
import nglview as nv
pdb=md.load_pdb(pdbfile)
view = nv.NGLWidget(nv.MDTrajTrajectory(pdb))
view.clear_representations()
view.add_cartoon('protein',color_scheme="chainindex")
view.camera='orthographic'
view
Let’s do a bit more analysis of what we have here:
def chaininfo(p):
t=p.topology
for chain in t.chains:
seglist=[]
first=last=-999
for res in chain.residues:
if (first<-100):
first=res.resSeq
else:
if (res.resSeq>last+1):
seglist+=[[first, last]]
first=res.resSeq
last=res.resSeq
seglist+=[[first,last]]
print(f'chain {chain.index} has residues {seglist}, first residue: {chain.residue(0)}')
print(pdb.topology)
chaininfo(pdb)
<mdtraj.Topology with 2 chains, 118 residues, 857 atoms, 865 bonds>
chain 0 has residues [[201, 247], [257, 322]], first residue: VAL201
chain 1 has residues [[401, 405]], first residue: HOH401
So in this case, there is only one protein chain. The second ‘chain’ contains water molecules.
The protein chain covers residues 201-322, with a break between 248-256. So the PDB structure does not cover the entire sequence!
Let’s set up a pdb that contains just one chain and a selection string we will use later for analysis:
refchain=pdb.topology.select("chainid 0")
pdbref=pdb.atom_slice(refchain)
firstres=pdbref.topology.residue(0).resSeq
lastres=pdbref.topology.residue(pdbref.topology.n_residues-1).resSeq
#refresidues="residue -2 to 86"
refresidues="residue 201 to 247 or residue 257 to 322"
Secondary structure prediction¶
It is a good idea to start with a prediction of the secondary structure to get an overall idea of a protein.
Secondary structure prediction traditionally means predicting helices and sheets, but there is additional information about which parts of a sequence may be disordered or correspond to transmembrane helices.
So let’s start with submitting the amino acid sequence (from the FASTA file above) to the PSIPRED server at:
http://bioinf.cs.ucl.ac.uk/psipred
There are many options, some of which take a bit of time. A good start is to request only the following:
PSIPRED 4.0 (secondary structure)
DISOPRED3 (disordered elements)
MEMSAT-SVM (prediction of transmembrane helices)
It should take 5-20 minutes to obtain an answer.
The results will appear in the web browser and you can download a package with the results.
The results for the first system are here
The first view that comes up shows the prediction of transmembrane helices. It predicts one TM helix in the middle of the sequence as shown below (the figure is from the package that can be downloaded from the PSIPRED server):
This would suggest two domains on different sides of a membrane. However the sequence entry in Uniprot does not say anything about it and it may be an incorrect prediction. You may need to check the literature for more information.
To see the secondary structure prediction, click on Show Psipred
.
The following figure should appear (which you can download and save to your ‘work’ directory):
So the structure is mostly ordered with both beta-sheets and alpha-helices. Potentially there are a few disordered segments that could be flexible linkers between different domains.
How consistent are the predictions from PSIPRED with the experimental structure?
Tertiary structure prediction¶
In the rest of the lab we will focus on tertiary structure prediction.
We will show you how to obtain models via the following tools:
AlphaFold2
RoseTTAfold
RaptorX
I-Tasser
Homology modeling via SWISS-MODEL
AlphaFold2 predictions¶
Before you think about making predictions, you should check whether a model has already been generated.
There is a database of AlphaFold2 models available here:
You can download structures for entire proteomes from here:
https://alphafold.ebi.ac.uk/download#proteomes-section
Alphafold Structures are also provided directly in many uniprot entries, e.g. here:
https://www.uniprot.org/uniprot/Q5VSL9
If you cannot find a model, you can run AlphaFold2 by yourself. There are three options:
Install AlphaFold2 on your own computer¶
official distribution with docker:
https://github.com/deepmind/alphafold
modified installations without docker:
https://pythonrepo.com/repo/kuixu-alphafold
https://github.com/kalininalab/alphafold_non_docker
Note that 2TB disk space and a recent GPU card are needed and it may take several days just to download the necessary databases.
Run AlphaFold2 via the DeepMind Colab notebook:¶
https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb
Run AlphaFold2 via an alternative Colab notebook:¶
https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb
All versions should run the same AlphaFold2 model, but they differ in how the multiple sequence alignment is generated and whether an initial template is taken from a template database.
The easiest and fastest way is the last notebook. It uses MMseqs2 for fast alignments and can be run without searching for templates (which often does not make much of a difference).
For the systems we are working on here (200-400 residues) predictions take between 30 minutes to five hours depending on the Runtime that you are connected to.
Notes:
You will need to login to Google with a non-MSU account and connect to a GPU runtime.
You may run out of resources in which case Google will prevent you from running until some time has passed (typically next day).
At the end of an AlphaFold2 run, five models are generated along with some additional analysis. all is packaged up into a zip file and presented for download.
Let’s look what we get for the example:
First, the following graph shows you the multiple sequence coverage as a function of residue.
Better coverage means more reliable contact predictions and more reliable structures for those residues. So in this case we have coverage across the entire sequence, but it is best towards the N-terminus.
The contact maps used in the predictions are shown next:
To assess the quality of the models that are generated, AlphaFold2 predicts lDDT scores as a measure of local residue accuracy. The maximum value is 100 and many parts of the structure reach such values as shown in the next plot. Residues for which the plDDT values are lower indicate that the model is likely less accurate - or that there is a dynamic region for which a single structure makes less sense.
Finally, we look at the models themselves. The models are ranked by the total plDDT scores.
Note: The names of the model PDBs in the downloaded package are different than what is used below (for simplicity). When you generate models yourself, you should either rename the PDBs to match the names below or change the code in the notebook. This also applies to all of the other methods below.
import mdtraj as md
import nglview as nv
def showmodels(mol):
view = nv.NGLWidget()
for n in range(len(mol)):
view.add_trajectory(nv.MDTrajTrajectory(mol[n]))
for n in range(len(mol)):
view[n].clear_representations()
if (len(mol)>0): view[0].add_cartoon('protein',color='red')
if (len(mol)>1): view[1].add_cartoon('protein',color='blue')
if (len(mol)>2): view[2].add_cartoon('protein',color='green')
if (len(mol)>3): view[3].add_cartoon('protein',color='orange')
if (len(mol)>4): view[4].add_cartoon('protein',color='brown')
if (len(mol)>5): view[5].add_cartoon('protein',color='purple')
if (len(mol)>6): view[6].add_cartoon('protein',color='black')
if (len(mol)>7): view[7].add_cartoon('protein',color='magenta')
if (len(mol)>8): view[8].add_cartoon('protein',color='yellow')
if (len(mol)>9): view[9].add_cartoon('protein',color='cyan')
view.camera='orthographic'
return(view)
af=[]
for n in range(5):
affile=f'{dir}/af2_rank_{n+1}.pdb'
af+=[md.load_pdb(affile)]
afnlist=af[0].topology.select(f'residue>={firstres} and residue<={lastres}')
for n in range(len(af)-1):
af[n+1].superpose(af[0],atom_indices=afnlist)
showmodels(af)
The topology is similar, but there is quite a bit of diversity, especially in the C-terminal domain.
We only have the experimental structure for the N-terminal domain, but let’s compare:
def get_rmsd(target,ref,residues):
reflist=ref.topology.select("name CA")
targetlist=target.topology.select("name CA and ("+residues+")")
return(md.rmsd(target,ref,atom_indices=targetlist,ref_atom_indices=reflist))
print(refresidues)
for n in range(len(af)):
rmsd=get_rmsd(af[n],pdbref,refresidues)
print(f'model {n+1}: {rmsd[0]*10} Å')
residue 201 to 247 or residue 257 to 322
model 1: 0.7707220315933228 Å
model 2: 0.6745235621929169 Å
model 3: 0.6987325847148895 Å
model 4: 0.7197705656290054 Å
model 5: 0.6506382673978806 Å
So how good is the best model for the N-terminal domain?
def comparewithpdb(pdb,model,residues):
reflist=pdb.topology.select("name CA")
targetlist=model.topology.select("name CA and ("+residues+")")
smodel=model.superpose(pdb,atom_indices=targetlist,ref_atom_indices=reflist)
view = nv.NGLWidget()
view.add_trajectory(nv.MDTrajTrajectory(pdb))
view.add_trajectory(nv.MDTrajTrajectory(smodel))
view[0].clear_representations()
view[0].add_cartoon('protein',color='red')
view[1].clear_representations()
view[1].add_cartoon('protein',color='blue')
view.camera='orthographic'
return(view)
afnterm=af[0].atom_slice(afnlist)
comparewithpdb(pdbref,afnterm,refresidues)
RoseTTAFold predictions¶
To run RoseTTAfold, you can install it again locally. For more details check here
Alternatively, you can obtain predictions via the Robetta server:
https://robetta.bakerlab.org/submit.php
You have to create an account and login in order to submit predictions but the server is responsive and you can get results relatively quickly.
When submitting a job, there are different options. Make sure you select RoseTTAFold
to get the best predictions. Older methods are available as well if you are interested in comparisons.
At the end five models are generated.
The results for the example are here
On the web site there is additional information about predicated accuracy as a function of residue index for each of the model.
You can download only a single PDB that contains all five models.
The resulting PDB is processed below:
rff=[]
rfile=f'{dir}/rff_models.pdb'
for n in range(5):
rff+=[md.load_pdb(rfile,frame=n)]
rfnlist=rff[0].topology.select(f'residue>={firstres} and residue<={lastres}')
for n in range(len(rff)-1):
rff[n+1].superpose(rff[0],atom_indices=rfnlist)
showmodels(rff)
As you can see, there is again quite a bit of diversity, especially in the C-terminal domain.
Let’s compare again with the N-terminal experimental structure:
print(refresidues)
for n in range(len(rff)):
rmsd=get_rmsd(rff[n],pdbref,refresidues)
print(f'model {n+1}: {rmsd[0]*10} Å')
residue 201 to 247 or residue 257 to 322
model 1: 1.010279357433319 Å
model 2: 1.1361155658960342 Å
model 3: 1.0284408926963806 Å
model 4: 1.0651279985904694 Å
model 5: 1.0534000396728516 Å
rffnterm=rff[0].atom_slice(rfnlist)
comparewithpdb(pdbref,rffnterm,refresidues)
RaptorX predictions¶
RaptorX is available for download here
Otherwise, we can submit a sequence to the server at:
http://raptorx.uchicago.edu/ContactMap/
There are no important options, just the amino acid sequence is needed.
No registration is needed, but once an email is provided, there is an automatic registration process.
It may take a few hours to obtain a prediction depending on how busy the server is.
At the end five models are generated.
The results for the example are here
There is quite a bit information on the web interface and a package with the complete results can be downloaded.
The models are ranked by a scoring function. The models are anlyzed below:
rx=[]
for n in range(5):
rfile=f'{dir}/rx_rank_{n+1}.pdb'
rx+=[md.load_pdb(rfile)]
rxnlist=rx[0].topology.select(f'residue>={firstres} and residue<={lastres}')
for n in range(len(rx)-1):
rx[n+1].superpose(rx[0],atom_indices=rxnlist)
showmodels(rx)
Let’s compare with the experimental structure:
print(refresidues)
for n in range(len(rx)):
rmsd=get_rmsd(rx[n],pdbref,refresidues)
print(f'model {n+1}: {rmsd[0]*10} Å')
residue 201 to 247 or residue 257 to 322
model 1: 1.6687951982021332 Å
model 2: 1.705666333436966 Å
model 3: 1.6815796494483948 Å
model 4: 1.7492686212062836 Å
model 5: 1.8515191972255707 Å
rxnterm=rx[0].atom_slice(rxnlist)
comparewithpdb(pdbref,rxnterm,refresidues)
I-Tasser predictions¶
I-Tasser software can be downloaded from here
The server is available via the following link:
https://zhanggroup.org/I-TASSER/
Registration (not with a .com email address!) is required and only one job can be submitted at a time.
The input is the amino acid sequence, an email address, a password (obtained after registering) . There are additional options to include restraints but for standard structure prediction they are not needed.
The server is quite busy and it may take about a day to obtain results.
At the end five models are generated.
The output for the example is here
The models are ranked by C-score
, a confidence score that is correlated with expected model accuracy. Larger values are better and in this example, none of the C-scores are very good.
There is a package with the results that can be downloaded with a lot of information, including structures of the fragments from threading that were used in the model assembly.
Below, we look at the models:
it=[]
for n in range(5):
rfile=f'{dir}/it_rank_{n+1}.pdb'
it+=[md.load_pdb(rfile)]
itnlist=it[0].topology.select(f'residue>={firstres} and residue<={lastres}')
for n in range(len(it)-1):
it[n+1].superpose(it[0],atom_indices=itnlist)
showmodels(it)
The topology is similar, but there is quite a bit of diversity, especially in the C-terminal domain.
We only have the experimental structure for the N-terminal domain, but let’s compare:
print(refresidues)
for n in range(len(it)):
rmsd=get_rmsd(it[n],pdbref,refresidues)
print(f'model {n+1}: {rmsd[0]*10} Å')
residue 201 to 247 or residue 257 to 322
model 1: 23.021240234375 Å
model 2: 23.025593757629395 Å
model 3: 20.844626426696777 Å
model 4: 1.095193549990654 Å
model 5: 1.092638298869133 Å
itnterm=it[0].atom_slice(itnlist)
comparewithpdb(pdbref,itnterm,refresidues)
That is not so great but there is another model …
itnterm2=it[3].atom_slice(itnlist)
comparewithpdb(pdbref,itnterm2,refresidues)
SWISS-MODEL predictions¶
Finally, we come to homology modeling. The classic software for homology modeling is MODELLER. The software can be installed easily, but to use it, one has to find templates and generate alignments separately, for example via HHpred.
Another way to do homology is via the SWISS-MODEL service:
https://swissmodel.expasy.org/interactive
There are two steps:
Search for possible templates based on PDB structures based on the given sequence
Model building after selecting a template
We want to select the best template, but SWISS-MODEL updates their databases very rapidly and already had the new PDB structures immediately after they were released.
To test homology modeling we have to check the presented templates carefully and select one that is not identical to the experimental structure.
For the example, we choose 4Z2Z as the best template that is listed after the actual PDB structures. Once we have selected a template, we can build a model.
It is possible to build multiple models based on multiple templates. Homology modeling is fast and results are available quickly.
An alternative is to build models based on a user-provided template via the same link. In this case, we have to provide a PDB structure. With the large number of AlphaFold2 models, it is an interesting option to use one of those models from the AlphaFold database as a template. You can find suitable templates based on sequence via this interface:
https://www.ebi.ac.uk/Tools/sss/fasta/
You will have to go to the Structures
tab and select AlphaFold DB
.
For the example, the closest template is AFDB:AF-Q8IM03-F1
The results for the example based on 4Z2Z are here
The result for the example based on the AlphaFold2 template structure are here
As with other methods, there is an estimate of model quality and the generated model can be downloaded (under the Model 01
tab on the left).
Let’s look first at the homology model based on the closest template from the PDB (4Z2Z):
rfile=f'{dir}/swissmodel_hm.pdb'
smhm=[md.load_pdb(rfile)]
showmodels(smhm)
The homology model was built as a dimer because the template is a dimer. Let’s continue with just the first chain.
chaininfo(smhm[0])
smhm[0]=smhm[0].atom_slice(smhm[0].topology.select("chainid 0"))
chain 0 has residues [[189, 325]], first residue: LEU189
chain 1 has residues [[188, 326]], first residue: ILE188
print(refresidues)
for n in range(len(smhm)):
rmsd=get_rmsd(smhm[n],pdbref,refresidues)
print(f'model {n+1}: {rmsd[0]*10} Å')
residue 201 to 247 or residue 257 to 322
model 1: 1.5669961273670197 Å
smhmnlist=smhm[0].topology.select(f'residue>={firstres} and residue<={lastres}')
smhmnterm=smhm[0].atom_slice(smhmnlist)
comparewithpdb(pdbref,smhmnterm,refresidues)
Finally let’s look at the homology model based on the closest AF2 template (AF_Q8IM03):
rfile=f'{dir}/swissmodel_af.pdb'
smaf=[md.load_pdb(rfile)]
showmodels(smaf)
print(refresidues)
for n in range(len(smaf)):
rmsd=get_rmsd(smaf[n],pdbref,refresidues)
print(f'model {n+1}: {rmsd[0]*10} Å')
residue 201 to 247 or residue 257 to 322
model 1: 0.9636662155389786 Å
smafnlist=smaf[0].topology.select(f'residue>={firstres} and residue<={lastres}')
smafnterm=smaf[0].atom_slice(smafnlist)
comparewithpdb(pdbref,smafnterm,refresidues)
Here is the analysis for all of the models together:
print(refresidues)
for n in range(len(af)):
rmsd=get_rmsd(af[n],pdbref,refresidues)
print(f'AlphaFold2 model {n+1}: {rmsd[0]*10} Å')
for n in range(len(rff)):
rmsd=get_rmsd(rff[n],pdbref,refresidues)
print(f'RoseTTAFold model {n+1}: {rmsd[0]*10} Å')
for n in range(len(rx)):
rmsd=get_rmsd(rx[n],pdbref,refresidues)
print(f'Raptor X model {n+1}: {rmsd[0]*10} Å')
for n in range(len(it)):
rmsd=get_rmsd(it[n],pdbref,refresidues)
print(f'I-Tasser model {n+1}: {rmsd[0]*10} Å')
for n in range(len(smhm)):
rmsd=get_rmsd(smhm[n],pdbref,refresidues)
print(f'SWISS-MODEL (PDB) model {n+1}: {rmsd[0]*10} Å')
for n in range(len(smaf)):
rmsd=get_rmsd(smaf[n],pdbref,refresidues)
print(f'SWISS-MODEL (AF) model {n+1}: {rmsd[0]*10} Å')
residue 201 to 247 or residue 257 to 322
AlphaFold2 model 1: 0.7707220315933228 Å
AlphaFold2 model 2: 0.6745235621929169 Å
AlphaFold2 model 3: 0.6987325847148895 Å
AlphaFold2 model 4: 0.7197705656290054 Å
AlphaFold2 model 5: 0.6506382673978806 Å
RoseTTAFold model 1: 1.010279357433319 Å
RoseTTAFold model 2: 1.1361155658960342 Å
RoseTTAFold model 3: 1.0284408926963806 Å
RoseTTAFold model 4: 1.0651279985904694 Å
RoseTTAFold model 5: 1.0534000396728516 Å
Raptor X model 1: 1.6687951982021332 Å
Raptor X model 2: 1.705666333436966 Å
Raptor X model 3: 1.6815796494483948 Å
Raptor X model 4: 1.7492686212062836 Å
Raptor X model 5: 1.8515191972255707 Å
I-Tasser model 1: 23.021240234375 Å
I-Tasser model 2: 23.025593757629395 Å
I-Tasser model 3: 20.844626426696777 Å
I-Tasser model 4: 1.095193549990654 Å
I-Tasser model 5: 1.092638298869133 Å
SWISS-MODEL (PDB) model 1: 1.5669961273670197 Å
SWISS-MODEL (AF) model 1: 0.9636662155389786 Å
best=[pdbref]
reflist=pdbref.topology.select("name CA")
list=af[0].topology.select("name CA and ("+refresidues+")")
best+=[af[0].superpose(pdbref,atom_indices=list,ref_atom_indices=reflist)]
list=rff[0].topology.select("name CA and ("+refresidues+")")
best+=[rff[0].superpose(pdbref,atom_indices=list,ref_atom_indices=reflist)]
list=rx[0].topology.select("name CA and ("+refresidues+")")
best+=[rx[0].superpose(pdbref,atom_indices=list,ref_atom_indices=reflist)]
list=it[0].topology.select("name CA and ("+refresidues+")")
best+=[it[0].superpose(pdbref,atom_indices=list,ref_atom_indices=reflist)]
list=smhm[0].topology.select("name CA and ("+refresidues+")")
best+=[smhm[0].superpose(pdbref,atom_indices=list,ref_atom_indices=reflist)]
list=smaf[0].topology.select("name CA and ("+refresidues+")")
best+=[smaf[0].superpose(pdbref,atom_indices=list,ref_atom_indices=reflist)]
showmodels(best)
How do your predictions look like?
Can you think of more ways in which the models can be analyzed/compared?
Which model you would select and which parts you would trust?