Unit III
Sequencing Alignment and Dynamic Programming
Alignment-Local,Global alignment,pairwise and multiple sequence alignments.
Pairwise local/global alignment - Introduction
The EMBOSS-Align tool can be used for pairwise alignment of sequences. There are three types of alignments we can consider when aligning sequnces, optimal, global and local alignments.
Optimal alignments
The alignment that is the best, given a defined set of rules and parameter values for comparing different alignments. There is no such thing as the single best alignment, since optimality always depends on the assumptions one bases the alignment on. For example, what penalty should gaps carry? All sequence alignment procedures make some such assumptions.
Global alignment
An alignment that assumes that the two proteins are basically similar over the entire length of one another. The alignment attempts to match them to each other from end to end, even though parts of the alignment are not very convincing. A tiny example:
NLGPSTKDFGKISESREFDNQ
| |||| |
QLNQLERSFGKINMRLEDALV
Local alignment
An alignment that searches for segments of the two sequences that match well. There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Using the same sequences as above, one could get:
NLGPSTKDDFGKILGPSTKDDQ
||||
QNQLERSSNFGKINQLERSSNN
It may seem that one should always use local alignments. However, it may be difficult to spot an overall similarity, as opposed to just a domain-to-domain similarity, if one uses only local alignment, so global alignment is useful in some cases. You can produce a global or a local alignment with the Emboss Pairwise global and local alignment tool.
EMBOSS-Align is the tool we will use to compare 2 sequences:
The EMBOSS-Align tool contains two programs each using a different algorithm:
When you want an alignment that covers the whole length of both sequences, use the needle program.
When you are trying to find the best region of similarity between two sequences, use the water program.
The following is a slightly more deatiled explanation of the two programs, needle and water, used in EMBOSS-Align:
Needle program - This program uses the Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps) of two sequences when considering their entire length.
The Needleman-Wunsch algorithm is a member of the class of algorithms that can calculate the best score and alignment in the order of mn steps, (where 'n' and 'm' are the lengths of the two sequences). These dynamic programming algorithms were first developed for protein sequence comparison by Needleman and Wunsch, though similar methods were independently devised during the late 1960's and early 1970's for use in the fields of speech processing and computer science.
What is the optimal alignment? Dynamic programming methods ensure the optimal global alignment by exploring all possible alignments and choosing the best. It does this by reading in a scoring matrix that contains values for every possible residue or nucleotide match. Needle finds an alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix.
An important problem is the treatment of gaps, i.e., spaces inserted to optimise the alignment score. A penalty is subtracted from the score for each gap opened (the 'gap open' penalty) and a penalty is subtracted from the score for the total number of gap spaces multiplied by a cost (the 'gap extension' penalty).
Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.
This is a true implementation of the Needleman-Wunsch algorithm and so produces a full path matrix. It therefore cannot be used with genome sized sequences unless you've a lot of memory and a lot of time.
Needle is for aligning two sequences over their entire length. This works best with closely related sequences. If you use needle to align very distantly-related sequences, it will produce a result but much of the alignment may have little or no biological significance.
A true Needleman Wunsch implementation like needle needs memory proportional to the product of the sequence lengths. For two sequences of length 10,000,000 and 1,000 it therefore needs memory proportional to 10,000,000,000 characters. Two arrays of this size are produced, one of integers and one of floats so multiply that figure by 8 to get the memory usage in bytes. That doesn't include other overheads. Therefore only use water and needle for accurate alignment of reasonably short sequences.
Water program - Water uses the Smith-Waterman algorithm (modified for speed enhancements) to calculate the local alignment.
The Smith-Waterman algorithm is a member of the class of algorithms that can calculate the best score and local alignment in the order of mn steps, (where 'n' and 'm' are the lengths of the two sequences). These dynamic programming algorithms were first developed for protein sequence comparison by Smith and Waterman, though similar methods were independently devised during the late 1960's and early 1970's for use in the fields of speech processing and computer science.
A local alignment searches for regions of local similarity between two sequences and need not include the entire length of the sequences. Local alignment methods are very useful for scanning databases or other circumstances when you wish to find matches between small regions of sequences, for example between protein domains.
Dynamic programming methods ensure the optimal local alignment by exploring all possible alignments and choosing the best. It does this by reading in a scoring matrix that contains values for every possible residue or nucleotide match. Water finds an alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix.
An important problem is the treatment of gaps, i.e., spaces inserted to optimise the alignment score. A penalty is subtracted from the score for each gap opened (the 'gap open' penalty) and a penalty is subtracted from the score for the total number of gap spaces multiplied by a cost (the 'gap extension' penalty).
Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.
Water is a true implementation of the Smith-Waterman algorithm and so produces a full path matrix. It therefore cannot be used with genome sized sequences unless you have a lot of memory and a lot of time.
Local alignment methods only report the best matching areas between two sequences - there may be a large number of alternative local alignments that do not score as highly. If two proteins share more than one common region, for example one has a single copy of a particular domain while the other has two copies, it may be possible to "miss" the second and subsequent alignments. You will be able to see this situation if you have done a dotplot and your local alignment does not show all the features you expected to see.
Water is for aligning the best matching subsequences of two sequences. It does not necessarily align whole sequences against each other; you should use needle if you wish to align closely related sequences along their whole lengths.
A true Smith Waterman implementation like water needs memory proportional to the product of the sequence lengths. For two sequences of length 10,000,000 and 1,000 it therefore needs memory proportional to 10,000,000,000 characters. Two arrays of this size are produced, one of integers and one of floating point (real) numbers so multiply that figure by 8 to get the memory usage in bytes. That doesn't include other overheads. Therefore only use water and needle for accurate alignment of reasonably short sequences.
Running an EMBOSS-Align alignment |
>myosin heavy chain 2A, skeletal muscle - rat kakkaitdaa mmaeelkkeq dtsahlermk knmeqtvkdl qlrldeaeql alkggkkqiq klearvrele geveseqkrn veavkglrkh errvkeltyq teedrknilr lqdlvdklqa kvksykrqae eaeeqsntnl skfrklqhel eeaeeradia esqvnklrvk srevhtkvis ee |
Fragment of human myosin heavy chain 2A |
>myosin heavy chain 2A,skeletal muscle - human elldaservq llhtqntsli ntkkkletdi sqmqgemedi lqearnaeek akkaitdaam maeelkkeqd tsahlermkk nmeqtvkdlq lrldeaeqla lkggkkqiqk learvreleg eveseqkrna eavkglrkhe rrvkeltyqt eedrknilrl qdlvdklqak vksykrqaee aeeqsntnla kfrklqhele eaeeradiae sqvnklrvks revhtkvise e |
- The 2 sequences were entered in FASTA format, which consists of a one-line header starting with a ">" symbol, followed by the sequence name/description. The sequence is then entered on new line(s) and terminated with a line consisting of "//".
- 2 jobs were run, one with the EMBOSS global alignment program (needle), and one with the local alignment program (water).
- As we are comparing 2 protein sequences, the molecule type was left on protein.
- The default blosum62 matrix was used, and the default gap open of "10" and gap extend of "0.5" were also used.
|
Results of EMBOSS-Align alignmentsNote that matching amino acids are connected with a "|" symbol. Mismatches would be connected with a space. A gap would be represented with a "-" symbol. Similar amino acids (e.g. leucine vs methionine) are connected via a "." symbol. Thus a sequence alignment can be represented in the format...
DMFCNTEGQGIAMM | |||| .. TMG--NEGQGSETT The %id is the percentage of identical matches between the two sequences over the reported aligned region. The %similarity is the percentage of matches between the two sequences over the reported aligned region where the scoring matrix value is greater or equal to 0.0. The Overall %id and Overall %similarity are calculated in a similar manner for the number of matches over the length of the longest of the two sequences.
Global Alignment Results
Matrix used Blosum62
Gap penalties used
Gap open 10.0
Gap extend 0.5 |
######################################## # Program: needle # Rundate: Mon Jan 27 11:42:35 2003 # Align_format: srspair # Report_file: /ebi/extserv/old-work/87881043667742.needle.res ######################################## #======================================= # # Aligned_sequences: 2 # 1: myosin # 2: myosin # Matrix: EBLOSUM62 # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 231 # Identity: 180/231 (77.9%) # Similarity: 181/231 (78.4%) # Gaps: 49/231 (21.2%) # Score: 878.0 # # #======================================= myosin 1 k 1 | myosin 1 elldaservqllhtqntslintkkkletdisqmqgemedilqearnaeek 50 myosin 2 akkaitdaammaeelkkeqdtsahlermkknmeqtvkdlqlrldeaeqla 51 |||||||||||||||||||||||||||||||||||||||||||||||||| myosin 51 akkaitdaammaeelkkeqdtsahlermkknmeqtvkdlqlrldeaeqla 100 myosin 52 lkggkkqiqklearvrelegeveseqkrnveavkglrkherrvkeltyqt 101 |||||||||||||||||||||||||||||.|||||||||||||||||||| myosin 101 lkggkkqiqklearvrelegeveseqkrnaeavkglrkherrvkeltyqt 150 myosin 102 eedrknilrlqdlvdklqakvksykrqaeeaeeqsntnlskfrklqhele 151 |||||||||||||||||||||||||||||||||||||||:|||||||||| myosin 151 eedrknilrlqdlvdklqakvksykrqaeeaeeqsntnlakfrklqhele 200 myosin 152 eaeeradiaesqvnklrvksrevhtkvisee 182 ||||||||||||||||||||||||||||||| myosin 201 eaeeradiaesqvnklrvksrevhtkvisee 231 #--------------------------------------- #--------------------------------------- |
Local Alignment Results
Matrix used Blosum62
Gap penalties used
Gap open 10.0
Gap extend 0.5
|
######################################## # Program: water # Rundate: Mon Jan 27 11:41:59 2003 # Align_format: srspair # Report_file: /ebi/extserv/old-work/86951043667707.water.res ######################################## #======================================= # # Aligned_sequences: 2 # 1: myosin # 2: myosin # Matrix: EBLOSUM62 # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 182 # Identity: 180/182 (98.9%) # Similarity: 181/182 (99.5%) # Gaps: 0/182 ( 0.0%) # Score: 878.0 # # #======================================= myosin 1 kakkaitdaammaeelkkeqdtsahlermkknmeqtvkdlqlrldeaeql 50 |||||||||||||||||||||||||||||||||||||||||||||||||| myosin 50 kakkaitdaammaeelkkeqdtsahlermkknmeqtvkdlqlrldeaeql 99 myosin 51 alkggkkqiqklearvrelegeveseqkrnveavkglrkherrvkeltyq 100 ||||||||||||||||||||||||||||||.||||||||||||||||||| myosin 100 alkggkkqiqklearvrelegeveseqkrnaeavkglrkherrvkeltyq 149 myosin 101 teedrknilrlqdlvdklqakvksykrqaeeaeeqsntnlskfrklqhel 150 ||||||||||||||||||||||||||||||||||||||||:||||||||| myosin 150 teedrknilrlqdlvdklqakvksykrqaeeaeeqsntnlakfrklqhel 199 myosin 151 eeaeeradiaesqvnklrvksrevhtkvisee 182 |||||||||||||||||||||||||||||||| myosin 200 eeaeeradiaesqvnklrvksrevhtkvisee 231 #--------------------------------------- #--------------------------------------- |
The results of the local and global alignments in this case are very similar, as the sequences are short. |
Concept of gap penalty and e-value.Alignment algorithms.
(Refer the following site for this material)
Dynamic programming in sequence alignment:Needleman-Wunsch Algorithm and Smith Waterman Algorithm,
(Refer the following site for this material)
Aminoacid Substitution matrices (PAM,BLOSUM).
Introduction
It is assumed that the sequences being sought have an evolutionary ancestral sequence in common with the query sequence. The best guess at the actual path of evolution is the path that requires the fewest evolutionary events. All substitutions are not equally likely and should be weighted to account for this. Insertions and deletions are less likely than substitutions and should be weighted to account for this. It is necessary to consider that the choice of search algorithm influences the sensitivity and selectivity of the search. The choice of similarity matrix determines both the pattern and the extent of substitutions in the sequences the database search is most likely to discover.
There have been extensive studies looking at the frequencies in which amino acids substituted for each other during evolution. The studies involved carefully aligning all of the proteins in several families of proteins and then constructing phylogenetic trees for each family. Each phylogenetic tree can then be examined for the substitutions found on each branch. This can then be used to produce tables(scoring matrices) of the relative frequencies with which amino acids replace each other over a short evolutionary period. Thus a substitution matrix describes the likelihood that two residue types would mutate to each other in evolutionary time.
A substitution is more likely to occur between amino acids with similar biochemical properties. For example the hydrophobic amino acids Isoleucine(I) and valine(V) get a positive score on matrices adding weight to the likeliness that one will substitute for another. While the hydrophobic amino acid isoleucine has a negative score with the hydrophilic amino acid cystine(C) as the likeliness of this substitution occurring in the protein is far less. Thus matrices are used to estimate how well two residues of given types would match if they were aligned in a sequence alignment.
Guidelines for using matricies
Protein Query Length | Matrix | Open Gap | Extend Gap |
>300 | BLOSUM50 | -10 | -2 |
85-300 | BLOSUM62 | -7 | -1 |
50-85 | BLOSUM80 | -16 | -4 |
>300 | PAM250 | -10 | -2 |
85-300 | PAM120 | -16 | -4 |
35-85 | MDM40 | -12 | -2 |
<=35 | MDM20 | -22 | -4 |
<=10 | MDM10 | -23 | -4 |
Importance of scoring matrices
- Scoring matrices appear in all analysis involving sequence comparison.
- The choice of matrix can strongly influence the outcome of the analysis.
- Scoring matrices implicitly represent a particular theory of evolution.
- Understanding theories underlying a given scoring matrix can aid in making proper choice.
Types of matrices
Differences between PAM and BLOSSUM
- PAM matrices are based on an explicit evolutionary model (that is, replacements are counted on the branches of a phylogenetic tree), whereas the Blosum matrices are based on an implicit rather than explicit model of evolution.
- The sequence variability in the alignments used to count replacements. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The Blosum matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps.
- The method used to count the replacements is different, unlike the PAM matrix, the Blosum procedure uses groups of sequences within which not all mutations are counted the same.
Equivalent PAM and Blossum matrices
The following matrices are roughly equivalent...
- PAM100 ==> Blosum90
- PAM120 ==> Blosum80
- PAM160 ==> Blosum60
- PAM200 ==> Blosum52
- PAM250 ==> Blosum45
Generally speaking...
- The Blosum matrices are best for detecting local alignments.
- The Blosum62 matrix is the best for detecting the majority of weak protein similarities.
- The Blosum45 matrix is the best for detecting long and weak alignments.
PAM (Point Accepted Mutation) matrix
Amino acid scoring matrices are traditionally PAM (Point Accepted Mutation) matrices which refer to various degrees of sensitivity depending on the evolutionary distance between sequence pairs. In this manner PAM40 is most sensitive for sequences 40 PAMs apart. PAM250 is for more distantly related sequences and is considered a good general matrix for protein database searching. For nucleotide sequence searching a simpler approach is used which either convert a PAM40 matrix into match/mismatch values which takes into consideration that a purine may be replaced by a purine and a pyrimidine by a pyrimidine.
e.g. The PAM 250 matrix
This is appropriate for searching for alignments of sequence that have diverged by 250 PAMs, 250 mutations per 100 amino acids of sequence. Because of back mutations and silent mutations this corresponds to sequences that are about 20 percent identical.
In this example Isoleucine(I) is likely to be substituted by valine(V) and gets a score of 4. Isoleucine(I) is unlikely to be substituted for Cystine and gets a score of -2.
BLOSSUM (Blocks Substitution Matrix)The BLOSUM matrices, also used for protein database search scoring (the default in blastp), are divided into statistical significance degrees which, in a way, are reminiscent of PAM distances. For example, BLOSUM64 is roughly equivalent to PAM 120. BLOSSUM Blocks Substitution Matrix). BLOSSUM matrices are most sensitive for local alignment of related sequences. The BLOSUM matrices are therefore ideal when tying to identify an unknown nucleotide sequence.
e.g. Blosum 45 Matrix
This is derived from sequence blocks clustered at the 45% identity level.
In this example Isoleucine(I) is likely to be substituted by valine(V) and gets a score of 3. Isoleucine(I) is unlikely to be substituted for Cystine and gets a score of -3.
Summary
These 2 matrices both generally perform well, but give slightly different results. The Blosum matrices have often been the better performers, reflecting the fact that the Blosum matrices are based on the replacement patterns found in more highly conserved regions of the sequences. This seems to be an advantage as these more highly conserved regions are those discovered in database searches and they serve as anchor points in alignments involving complete sequences. It is expected that the replacements that occur in more highly conserved regions will be more restricted than those that occur in highly variable regions of the sequence. This is supported by the different pattern of positive and negative scores in the two families of matrices. These different patterns of positive and negative scores reflect different estimates of what constitute conservative and non conservative substitutions in the evolution of proteins. These differences reflect the differences in constructing the two families of matrices. Some of the difference is also likely to be because the Blosum matrices are based on much more data than the PAM matrices. The PAM matrices still perform quite well despite the small amount of data underlying them. The most likely reasons for this are the care used in constructing the alignments and phylogenetic trees used in counting replacements and the fact that they are based on a simple model of evolution and thus they still perform better than some of the more modern matrices that are less carefully constructed.
GONNET Matrix
A different method to measure differences among amino acids was developed by Gonnet, Cohen and Benner (1992) using exhaustive pairwise alignments of the protein databases as they existed at that time. They used classical distance measures to estimate an alignment of the proteins. They then used this data to estimate a new distance matrix. This was used to refine the alignment, estimate a new distance matrix and so on iteratively. They noted that the distance matrices (all first normalised to 250 PAMs) differed depending on whether they were derived from distantly or closely homologous proteins. They suggest that for initial comparisons their resulting matrix should be used in preference to a PAM250 matrix, and that subsequent refinements should be done using a PAM matrix appropriate to the distance between proteins.
Here a you only get a positive score for a match, and a score of -10000 for a mismatch. As such a high penalty is given for a mismatch, no substitution should be allowed, although a gap may be permitted.
PAM matrices
PAM matrices are amino acid substitution matrices that encode the expected evolutionary change at the amino acid level. Each PAM matrix is designed to compare two sequences which are a specific number of PAM units apart. For example - the PAM120 score matrix is designed to compare between sequences that are 120 PAM units apart: The score it gives a pair of sequences is the (log of the) probabilities of such sequences evolving during 120 PAM units of evolution. For any specific pair (Ai, Aj) of amino acids the (i,j) entry in the PAM n matrix reflects the frequency at which Ai is expected to replace with Aj in two sequences that are n PAM units diverged. These frequencies should be estimated by gathering statistics on replaced amino acids.
Collecting statistics about amino acids substitution in order to compute the PAM matrices is relatively difficult for sequences that are distantly diverged, as mentioned in the previous section. But for sequences that are highly similar, i.e., the PAM divergence distance between them is small, finding the position correspondence is relatively easy since only few insertions and deletions took place. Therefore, in the first stage statistics were collected from aligned sequences that were believed to be approximately one PAM unit diverged and the PAM1 matrix could be computed based on this data, as follows: Let Mij denote the observed frequency (= estimated probability) of amino acid Ai mutating into amino acid Aj during one PAM unit of evolutionary change. M is a 20x real matrix, with the values in each matrix column adding up to 1. There is a significant variance between the values in each column.
Figure: The top left corner 5x5 of the PAM1 matrix. We write 104Mij for convinience. |
|
Once M is known, the matrix Mn gives the probabilities of any amino acid mutating to any other during n PAM units. The (i,j) entry in the PAM n matrix is therefore:
where f(i) and f(j) are the observed frequencies of amino acids Ai and Aj respectively. This approach assumes that the frequencies of the amino acids remain constant over time, and that the mutational processes causing substitutions during an interval of one PAM unit operate in the same manner for longer periods. We take the log value of the probability in order to allow computing the total score of all substitutions using summation rather than multiplication. The PAM matrix is usually organized by dividing the amino acids to groups of relatively similar amino acids and all group members are located in consecutive columns in the matrix.
BLOSUM - BLOcks SUbstitution Matrix
The BLOSUM matrix is another amino acid substitution matrix, first calculated by Henikoff and Henikoff [5]. For its calculation only blocks of amino acid sequences with small change between them are considered. These blocks are called conserved blocks (See figure 3.2). One reason for this is that one needs to find a multiple alignment between all these sequences and it is easier to construct such an alignment with more similar sequences. Another reason is that the purpose of the matrix is to measure the probability of one amino acid to change into another, and the change between distant sequences may include also insertions and deletions of amino acids. Moreover, we are more interested in conservation of regions inside protein families, where sequences are quite similar, and therefore we restrict our examination to such.
Figure 3.2: Alignment of several sequences. The conserved blocks are marked. |
|
The first stage of building the BLOSUM matrix is eliminating sequences, which are identical in more than x% of their amino acid sequence. This is done to avoid bias of the result in favor of a certain protein. The elimination is done either by removing sequences from the block, or by finding a cluster of similar sequences and replacing it by a new sequence that represents the cluster. The matrix built from blocks with no more the x% of similarity is called BLOSUM-x (e.g. the matrix built using sequences with no more then 50% similarity is called BLOSUM-50.)
The second stage is counting the pairs of amino acids in each column of the multiple alignment. For example in a column with the acids AABACA (as in the first column in the block in figure 3.2), there are 6 AA pairs, 4 AB pairs, 4 AC, and one BC. The probability qi,j for a pair of amino acids in the same column to be Ai and Aj is calculated, as well as the probability pi of a certain amino acid to be Ai.
In the third stage the log odd ratio is calculated as . As final result we consider the rounded 2si,j, this value is stored in the (i,j) entry of the BLOSUM-x matrix.
In contrast to the PAM matrices, more sequences are examined in the process of computing the BLOSUM matrix. Moreover, the sequences are of specific nature of resemblance, and therefore the two sets of matrices differ.
Comparing the efficiency of two matrices is done by calculating the ratio between the number of pairs of similar sequences discovered by a certain matrix but not discovered by another one and the number of pairs missed by the first but found by the other. According to this comparison BLOSUM-62 is found to be better than other BLOSUM-x matrices as well as than PAM-x matrices.
Sequence similarity search with database:BLAST and FASTA.
The statistics of global sequence comparison
Unfortunately, under even the simplest random models and scoring systems, very little is known about the random distribution of optimal global alignment scores. Monte Carlo experiments can provide rough distributional results for some specific scoring systems and sequence compositions, but these can not be generalized easily. Therefore, one of the few methods available for assessing the statistical significance of a particular global alignment is to generate many random sequence pairs of the appropriate length and composition, and calculate the optimal alignment score for each. While it is then possible to express the score of interest in terms of standard deviations from the mean, it is a mistake to assume that the relevant distribution is normal and convert this Z-value into a P-value; the tail behavior of global alignment scores is unknown. The most one can say reliably is that if 100 random alignments have score inferior to the alignment of interest, the P-value in question is likely less than 0.01. One further pitfall to avoid is exaggerating the significance of a result found among multiple tests. When many alignments have been generated, e.g. in a database search, the significance of the best must be discounted accordingly. An alignment with P-value 0.0001 in the context of a single trial may be assigned a P-value of only 0.1 if it was selected as the best among 1000 independent trials.
The statistics of local sequence comparison
Fortunately statistics for the scores of local alignments, unlike those of global alignments, are well understood. This is particularly true for local alignments lacking gaps, which we will consider first. Such alignments were precisely those sought by the original BLAST database search programs.
A local alignment without gaps consists simply of a pair of equal length segments, one from each of the two sequences being compared. A modification of the Smith-Waterman or Sellers algorithms will find all segment pairs whose scores can not be improved by extension or trimming. These are called high-scoring segment pairs or HSPs.
To analyze how high a score is likely to arise by chance, a model of random sequences is needed. For proteins, the simplest model chooses the amino acid residues in a sequence independently, with specific background probabilities for the various residues. Additionally, the expected score for aligning a random pair of amino acid is required to be negative. Were this not the case, long alignments would tend to have high score independently of whether the segments aligned were related, and the statistical theory would break down.
Just as the sum of a large number of independent identically distributed (i.i.d) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution . (We will elide the many technical points required to make this statement rigorous.) In studying optimal local sequence alignments, we are essentially dealing with the latter case . In the limit of sufficiently large sequence lengthsm and n, the statistics of HSP scores are characterized by two parameters, K and lambda. Most simply, the expected number of HSPs with score at least S is given by the formula
We call this the E-value for the score S.
This formula makes eminently intuitive sense. Doubling the length of either sequence should double the number of HSPs attaining a given score. Also, for an HSP to attain the score 2x it must attain the score x twice in a row, so one expects E to decrease exponentially with score. The parameters K and lambda can be thought of simply as natural scales for the search space size and the scoring system respectively.
Bit scores
Raw scores have little meaning without detailed knowledge of the scoring system used, or more simply its statistical parameters K and lambda. Unless the scoring system is understood, citing a raw score alone is like citing a distance without specifying feet, meters, or light years. By normalizing a raw score using the formula.
one attains a "bit score" S', which has a standard set of units. The E-value corresponding to a given bit score is simply
Bit scores subsume the statistical essence of the scoring system employed, so that to calculate significance one needs to know in addition only the size of the search space.
P-values
The number of random HSPs with score >= S is described by a Poisson distribution. This means that the probability of finding exactly a HSPs with score >=S is given by:
where E is the E-value of S given by equation (1) above. Specifically the chance of finding zero HSPs with score >=S is e-E, so the probability of finding at least one such HSP is.
This is the P-value associated with the score S. For example, if one expects to find three HSPs with score >= S, the probability of finding at least one is 0.95. The BLAST programs report E-value rather than P-values because it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995. However, when E < 0.01, P-values and E-value are nearly identical.
Database searches
The E-value of equation (1) applies to the comparison of two proteins of lengths m and n. How does one assess the significance of an alignment that arises from the comparison of a protein of length m to a database containing many different proteins, of varying lengths? One view is that all proteins in the database are a priori equally likely to be related to the query. This implies that a low E-value for an alignment involving a short database sequence should carry the same weight as a low E-value for an alignment involving a long database sequence. To calculate a "database search" E-value, one simply multiplies the pairwise-comparison E-value by the number of sequences in the database. Recent versions of the FASTA protein comparison programs take this approach.
An alternative view is that a query is a priori more likely to be related to a long than to a short sequence, because long sequences are often composed of multiple distinct domains. If we assume the a priori chance of relatedness is proportional to sequence length, then the pairwise E-value involving a database sequence of length n should be multiplied by N/n, where N is the total length of the database in residues. Examining equation (1), this can be accomplished simply by treating the database as a single long sequence of length N. The BLAST programs take this approach to calculating database E-value. Notice that for DNA sequence comparisons, the length of database records is largely arbitrary, and therefore this is the only really tenable method for estimating statistical significance.
The statistics of gapped alignments
The statistics developed above have a solid theoretical foundation only for local alignments that are not permitted to have gaps. However, many computational experiments and some analytic results strongly suggest that the same theory applies as well to gapped alignments. For ungapped alignments, the statistical parameters can be calculated, using analytic formulas, from the substitution scores and the background residue frequencies of the sequences being compared. For gapped alignments, these parameters must be estimated from a large-scale comparison of "random" sequences.
Some database search programs, such as FASTA or various implementation of the Smith-Waterman algorithm, produce optimal local alignment scores for the comparison of the query sequence to every sequence in the database. Most of these scores involve unrelated sequences, and therefore can be used to estimate lambda and K . This approach avoids the artificiality of a random sequence model by employing real sequences, with their attendant internal structure and correlations, but it must face the problem of excluding from the estimation scores from pairs of related sequences. The BLAST programs achieve much of their speed by avoiding the calculation of optimal alignment scores for all but a handful of unrelated sequences. The must therefore rely upon a pre-estimation of the parameters lambda and K, for a selected set of substitution matrices and gap costs. This estimation could be done using real sequences, but has instead relied upon a random sequence model, which appears to yield fairly accurate results.
Edge effects
The statistics described above tend to be somewhat conservative for short sequences. The theory supporting these statistics is an asymptotic one, which assumes an optimal local alignment can begin with any aligned pair of residues. However, a high-scoring alignment must have some length, and therefore can not begin near to the end of either of two sequences being compared. This "edge effect" may be corrected for by calculating an "effective length" for sequences; the BLAST programs implement such a correction. For sequences longer than about 200 residues the edge effect correction is usually negligible.
The content of this page was mainly stolen from http://www.psc.edu/biomed/dist-ed/ and http://www.ncbi.nlm.nih.gov/BLAST/tutorial/
BLAST: A simplification of Smith-Waterman
The BLAST algorithm (20) uses a word based heuristic similar to that of FASTA to approximate a simplification of the Smith-Waterman algorithm known as the maximal segment pairs algorithm. Maximal segment pairs alignments do not allow gaps and are specified by an equation that includes only the first and fourth terms of the Smith-Waterman equation presented above. Maximal segment pair alignments have the very valuable property that their statistics are well understood (15). Thus, we can readily compute a significance probability for a maximal segment pair alignment. Recent advances in maximal segment pairs statistics allow the use of several independent segment alignments to be used in evaluating the significance probability.
The price for being able to readily compute a significance probability is that the alignments can not have gaps (15). Thus, the evolutionary model requires that there be a fairly long stretch of sequence that has evolved without insertions or deletions, or at least with a complimentary pattern of insertions and deletions that do not significantly disrupt the alignment.
The BLAST algorithm is less sensitive than Smith-Waterman and in appropriate circumstances more selectivity. For proteins the BLAST word based heuristic is more sensitive than the FASTA heuristic even though BLAST uses a word size of three for proteins while FASTA uses a word size of two. However, BLAST uses a very long word size, eleven, for nucleic acid sequences. Also the modifications to the heuristic that improve the sensitivity for protein sequences do not work as well for nucleic acid sequences because have only a four letter alphabet and the similarity scores are usually calculated with equal rates of replacement for all of the nucleotides. Thus, FASTA is more sensitive than BLAST for nucleic acid sequences and should used instead of BLAST.
The BLAST word based heuristic uses a default word size of three for proteins and eleven for nucleic acid sequences. The tables below illustrate how the BLAST heuristic differs from the FASTA heuristic using a word size of two applied to a short protein query sequence. A word size of two is used in the example to keep it to a managable size.
Creating a Word List for BLAST, Word Size = 2
Adipokinetic hormone II - Migratory locust
q l n f s a g w
q l
l n
n f
f s
s a
a g
g w
In the BLAST heuristic this list of words is expanded in order to recover the sensitivity lost by matching only identical words. Any word that scores at least a minimum threshold when aligned with any of the initial list of words is added to the list. BLAST then examines the database sequences for words that exactly match any of the expanded list. The expanded below to a total list of 47 words derived from the initial 7. The expanded list contains any word that scores at least eight when aligned with the initial word and scored with the PAM 120 similarity table.
Initial
Word Expanded List
ql: ql, qm, hl, zl
ln: ln, lb
nf: nf, af, ny, df, qf, ef, gf, hf, kf, sf, tf, bf, zf
fs: fs, fa, fn, fd, fg, fp, ft, fb, ys
sa: no words score 8 or more, including the initial word sa
ag: ag
gw: gw, aw, rw, nw, dw, qw, ew, hw, iw, kw, mw, pw, sw,
tw, vw,bw, zw, xw
One noteworthy feature of the table above is that there is no word that scores eight or more when aligned with the initial word "sa". This situation does occur in actual BLAST searches. The user has the option to force the initial word into the final list. The default is not to include such low scoring words because they contain so little information that they are unlikely to be critical in finding the maximal segment pair alignment. In practice, BLAST has used a word length of 3 for protein database searches with a required threshold score of 13 using the Blosum62 similarity scoring matrix.
Recent Developments in BLAST: More Sensitivity and Gaps
The inventors of the BLAST algorithm have continued to look for ways to improve both the sensitivity and the speed of their algorithm (23). As we will see later, the growth of the sequence databases has raised the miniimum score and hence the length of alignment that must be found by BLAST for a match to be significant. For these longer alignments both the speed and the sensitivity of BLAST can be improved by requiring that the algorithm find two matches above some (lower) threshold rather than one match. Both matches must be on the same diagonal. The increase in speed results from improved specificity so that fewer sequences are completely evaluated In practice BLAST now looks for two words of length 3 that each score at least 11 using the Blosum62 similarity scoring matrix. The two matches must be within 40 amino acids of each other on the same diagonal.
Another recent improvement in the BLAST algorithm is to provide a rigorous gapped alignment as part of the results. The first steps that BLAST made towards providing gapped alignments was to find a number of high scoring segments in addition to the maximal scoring segment. A later version used a strategy similar to that found in FASTA. In this strategy the maximal scoring segment is used to define a band in the path graph and the Smith-Waterman (17) algorithm is used to find a gapped alignment that lays entirely within the band. However, as noted in the discussion of FASTA, this strategy can produce a less than optimal Smith-Waterman alignment if the number of gaps needed in the best gapped alignment is greater than the width of the band.
The new gapped BLAST gets around this problem and still avoids the high computational cost of a full Smith-Waterman alignment on the pair of sequences by building the alignment out from a central high scoring pair of aligned amino acids in a way analogous to BLAST extends the initial maximal segment pair alignment. The initial pair of aligned amino acids is chosen as the middle pair of the highest scoring window of 11 amino acids in the high scoring segment pair alignments. The Smith-Waterman alignment is then extended in all directions in the path graph until it falls below a fixed percentage of the highest score yet computed in the Smith-Waterman phase.
This method of computing the Smith-Waterman gapped alignment will always find the best scoring Smith-Waterman alignment if two conditions are met. First, the calculation is extended until a lower bounds of a score of zero is reached. Using a higher threshold for stopping the calculation accepts a very small risk of not finding the complete alignment in return for a very large savings in computer resources. The second criterion that must be met is that the initial pair of amino acids selected as the midpoint from which to extend the alignment must actually be part of the alignment that would be reported as the best alignment by a complete Smith-Waterman alignment of the pair of sequences. Again the selection criteria used by gapped BLAST appear to be very effective in selecting an appropriate pair of amino acids from which to extend the alignment.
FASTA: A fast approximation to Smith-Waterman
FASTA is a two step algorithm (19). The first step is a search for highly similar seqments in the two sequences. In this search a word with a specific word size is used to find regions in a two-dimensional table table similar to that shown for the Smith-Waterman algorithm. . These regions are a diagonal or a few closely spaced diagonals in the table which have a high number of identical word matches between the sequences. The second step is a Smith-Waterman alignment centered on the diagonals that correspond to the alignment of the highly similar sequence segments. The region for the Smith-Waterman alignment is bounded by a window size. The window size limits the number of insertions or deletions one sequence can accumulate with respect to the other sequence in the alignment. Thus, the significant speedups in observed in a FASTA search relative to a full Smith-Waterman search is due to the prior restriction in alignment space as well as skipping the Smith-Waterman step when no diagonals are found that correspond to aolignments between highly similar sequence segments.
The FASTA algorithm is a heuristic approximation to the Smith-Waterman algorithm. The heuristics used by FASTA allow it to run much faster that the Smith-Waterman algorithm but at the cost of some sensitivity. Two heuristics are employed. Both can be interpreted as restrictions on the model of sequence evolution that is used in comparing the sequences. The first restriction is implemented by the word size parameter, usually two for proteins and six for nucleic acids. This means that FASTA constrains the evolution between a pair of sequences to preserve a number of unchanged dipeptides or hexanucleotides. The effect of this word size restriction can be seen in the dot plots shown below.
In the dot plots below each asterisks marks the beginning of a string of matches in the two short nucleotide sequence shown at the edges of the dot plot. The length of the matches thus varies between one and three nucleotides. As can be seen with a word size of one the dot plot is so noisy that it is difficult to pick out any region that might correspond to an alignment that accurately reflects homology between the sequences. However, when we shift to a word size of two the region more or less in the center of the dot plot stands out as offering the most possibilities for a long alignment with a high degree of sequence identity. In cases like this the loss of sensitivity resulting from increasing the word size has also resulted in a worthwhile increase in selectivity. In the final dot plot, with a word size of three, not only has most of the noise in the dot plot been removed but so has most of the information indicating the regions involved in the best alignment between the two sequences. Given the same scoring parameters the best alignment should be the same alignment found by the Smith-Waterman algorithm and shown above.
Dot Plot: Word Size = 1
g c t g g a a g g c a t
g *
* * * *
c * *
a *
* *
g * * * * *
a * *
*
g * * * * *
c * *
a
* * *
c * *
t
* *
Dot Plot: Word Size = 2
g c t g g a a g g c a t
g *
*
c *
a
*
g *
a *
g *
*
c *
a
c *
t
Dot Plot: Word Size = 3
g c t g g a a g g c a t
g
*
c
a
g
a
g *
c
a
c
t
The first step in the FASTA algorithm is to divide the query sequence into its constituent overlapping words of length two for proteins or six for nucleic acids. This process is shown in the table below. Then as each sequence is read from the database it is also divided into its constituent overlapping words. These two list of words are compared to find the identical words in both sequences. An initial score is computed based on the number of identities concentrated within small regions of the dot plot. If this initial score is high enough a second score is computed by evaluating which of the initial identities can be joined into a consistent alignment using only gaps of less than some maximum length. This maximum lengths for gaps is set by the window size parameter. Finally, if the secondary score is high enough then a Smith-Waterman alignment is performed within the region of the dot plot defined by the concentrated identities and the window size. This third score is reported as the optimal score.
Creating a Word List for FASTA, Word Size = 6
g c t g g a a g g c a t
g c t g g a
c t g g a a
t g g a a g
g g a a g g
g a a g g c
a a g g c a
a g g c a t
The weaknesses in the FASTA approach can be shown in two pathological examples. The first example would have two proteins that share 50% identity - but the proper alignment consists of alternating match and mismatches. With a word size of two, there would be no word matches along the main diagonal of the dot plot for the sequences (although there will potentially be random or spurious word matches on the off-diagonals) and the proper alignment would probably not be found. The second case consists of two proteins that are almost identical, except the second protein has a 20 residue insertion into the middle of the sequence. If the window size is 15, then the Smith-Waterman alignment phase of FASTA will not have enough alignment space to jump the 20 residue insertion. Thus, the resulting alignment will be either the sequence prior to or after the insertion (whichever had the higher diagonal scores) and the fact that the proteins were basically identical (with only one long insertion) will be missed.
The window size discussed in the previous paragraph is the second heuristic used by FASTA. Its effect is more variable than that of the word size. If the best alignment lies entirely within the window defined by the window size and the concentrated identities there is no effect. However if the best alignment, as defined by a full Smith-Waterman analysis, goes outside the window then a lower scoring alignment will be found by FASTA. This can lead the user to conclude that the sequences are not homologous when in fact they are and homology could have been inferred from a full Smith-Waterman alignment.
In practice these pathological cases are very unlikely. However, similar cases do occur and the loss of sensitivity caused by the use of these heuristics will be seen in the examples presented at the end of the tutorial.