Homology Modeling of Proteins (Sequence to Structure)


Homology modeling is used to predict the 3D-structure of a unknown protein based on the known structure of a similar protein. During evolution, sequence changes much faster than structure. It is possible to identify the 3D-structure by looking at a molecule with some sequence identity. Figure 1 shows how much sequence identity is needed with a certain number of aligned residues to reach the safe homology modeling zone. For a sequence of 100 residues, for example, a sequence identity of 40% is sufficient for structure prediction. When the sequence identity falls in the safe homology modeling zone, we can assume that the 3D-structure of both sequences is the same. 



Figure 1: We can assume that two proteins share the same 3D-structure when their sequence identity falls in the safe homology zone, the upper part of the picture.  (Figure taken from: http://www.cmbi.ru.nl/gvteach/astra/lectures/homology_modelling.ppt


The known structure is called the template, the unknown structure is called the target. 
Homology modeling of the target structure can be done in 7 steps:

1: Template recognition and initial alignment 
In this step you compare the sequence of the unknown structure with all the known structures stored in the Protein Data Bank (PDB). A search with BLAST against this database will give a list of known protein structures that match the sequence. If BLAST cannot find a template a more sophisticated technique might be necessary to identify the structure of a molecule.
BLAST uses a residue exchange matrix to define a score for every hit. Residues that are easily exchanged (for example Ile to Leu) get a better score than residues that have different properties (for example Glu to Trp). Conserved residues with a specific function get the best score (for example Cys to Cys). Every hit is scored using this matrix and BLAST will provide a list of possible templates for the uknown structure. To make the best initial alignment, BLAST uses an alignment-matrix based on the residue exchange matrix and adds extra penalties for opening and extension of a gap between residues. In practice the target-sequence is sent to a BLAST server, which searches the PDB to obtain a list of possible templates and their alignments. Subsequently the best hit has to be chosen, which is not necessarily the first one. One has to keep in mind the resolution, missing parts, different states of action and possible ligands of the molecules in doing so.

2: Alignment correction
It is possible that the alignment has to be corrected. A change of Ala to Glu is possible but unlikely to happen in a hydrophobic core, so this Ala and Glu cannot be aligned. Using a multiple sequence alignment program (ClustalW) the residues and properties that have to be conserved can be found. By looking at the template structure it will become clear which residues are in the core and are less likely to be changed than the residues at the outside. Insertions and deletions can be made in widely divergent parts of the molecule and a multiple sequence alignment can be helpful to find these places. Gaps have to be shifted around until they are as small as possible. In Figure 2 is shown that after a deletion of 3 residues a big gap occurs in the red structure, which was the best alignment. After shifting several residues, the gap is much smaller (blue structure) and more likely to be correct. Correction of the alignment is typically done by hand.





Figure 2: Template structure (green) with the best aligned target (red) with a large gap, and the target after shifting several residues (blue). The gap is much smaller now. (Figure taken from: http://www.cmbi.ru.nl/gvteach/astra/lectures/homology_modelling.ppt)
3: Backbone generation
When the alignment is correct, the backbone of the target can be created. The coordinates of the template-backbone are copied to the target. When the residues are identical, the side-chain coordinates are also copied. Because a PDB-file can always contain some errors, it can be useful to make use of multiple templates.

4: Loop modeling
Often the alignment will contain gaps as a result of deletions and insertions. When the target sequence contains a gap, one can simply delete the corresponding residues in the template. This creates a hole in the model, this has already been discussed in step 2. When there is an insertion in the target, shown in Figure 3, the template will contain a gap and there are no backbone coordinates known for these residues in the model. The backbone from the template has to be cut to insert these residues. Such large changes cannot be modeled in secondary structure elements and therefore have to be placed in loops and strands. Surface loops are, however, flexible and difficult to predict. One way to handle loops is to take some residues before and after the insertion as "anchor" residues and search the PDB for loops with the same anchor-residues. The best loop is simply copied in the model. This is shown in Figure 4. The two residues which are colored green in Figure 4 are used as anchor, the best loop with the inserted resisdues was found in the database and placed in the model.





Figure 3: Target sequence (green) with insertion (grey box) results in a gap in the template.
(Figure taken from http://www.itb.uni-stuttgart.de/training/bioinformatics02/GROUP6/session_10.html)







 Figure 4: The red loop is modeled with the green residues as anchor residues. The insertion of 2 residues results in a longer loop. 
(Figure taken from http://www.itb.uni-stuttgart.de/training/bioinformatics02/GROUP6/session_10.html)

5: Side-chain modeling


Now it is time to add side-chains to the backbone of the model. Conserved residues were already copied completely. The torsion angle between C-alpha and C-beta of the other residues can also be copied to the model because these rotamers tend to be conserved in similar proteins. It is also possible to predict the rotamer because many backbone configurations strongly prefer a specific rotamer. As shown in Figure 5, the backbone of this tyrosine strongly prefers two rotamers and the real side-chain fits in one of them. There are libraries based upon the backbone of the residues flanking the residue of interest. By using these libraries the best rotamer can be predicted. This last method is used by Yasara.



Figure 5: Prefered rotamers of this tyrosin (colored sticks) the real side-chain (cyan) fits in one of them. 

6: Model optimization
The model has to be optimized because the changed side-chains can effect the backbone, and a changed backbone will have effect on the predicted rotamers. Optimization can be done by performing refinements using Molecular Dynamics simulations of the model. The model is placed in a force-field and the movements of the molecules are followed in time, this mimics the folding of the protein. The big errors like bumps will be removed but new smaller errors can be introduced. The calculated energy should be as low as possible.




Figure 6: Crambin in a simulation cell during a molecular dynamics simulation.
7: Model validation
Every model contains errors. The model with the lowest forcefield energy might still be folded completely wrong. That is why the model should be checked for bumps and if the bond angles, torsion angles and bond lenghts are within normal ranges. Other properties, like the distribution of polar/apolar residues, can be compared with real structures. This can be done by using the check-options in WHATIF. The output can help in the identification of errors in the model. When an error occurs far away from the active site, it does not have to be bad. But when an error occurs in the active site, one should reconsider the template and/or alignment.

In practice the majority of these steps can be done by the program WHATIF Yasara Twinset. This program needs a target sequence with predicted secondary structure and a template. Both can be obtained by sending the sequence to the online psipred-server. The first time this server should predict the secondary structure (PSIPRED v2.4), the second time the server should perform a fold recognition (mGenTHREADER - with profiles and predicted secondary structure). With the output from this server, Yasara can make an alignment and start building the model. For different chains in one model the procedure is slightly different. Now the alignment has to be made by hand and can be used as pir-file in Yasara. The secondary structure prediction is still needed. Yasara will generate different candidate models. Refinements and molecular dynamics simulations will be performed automatically on these and the scores are compared. The best model is kept. Validation checks are done by WHATIF.