SYLLABUS OF BIOINFORMATICS ( AS PER JNTUH,HYDERABAD ,AP.INDIA) FOR BTECH BIOTECHNOLOGY AND BTECH INFORMATION TECHNOLOGY COURSES 2011-2012

Unit-I

Introduction to Bioinformatics

Scope of Bioinformatics, Elementary Commands and Protocols,ftp,telnet,http,Primer on information theory

Unit II

Special topics in Bioinformatics

DNA mapping and sequencing,map alignment.Large scale sequencing methods ,shortgun and sanger method,cDNA sequencing Genome mapping ,map assembly, Comparative sequence analysis.

Unit III

Sequencing Alignment and Dynamic Programming

Alignment-Local,Global alignment,pairwise and multiple sequence alignments.Concept of gap penalty and e-value.Alignment algorithms.Dynamic programming in sequence alignment:Needleman-Wunsch Algorithm and Smith Waterman Algorithm,Aminoacid Substitution matrices (PAM,BLOSUM).Sequence similarity search with database:BLAST and FASTA.

Unit IV

Primary database Information

Introduction to Biological databases,organization and management of databases,searching and retrieval of information from world wide web.Structure databases-PDB (protein data Bank),Molecular Modeling databases(MMDB),Primary databases NCBL,EMBL,DDBJ.

Unit V

Secondary database

Introduction to Biological databases,organization and management of databases Swissprot,PIR,KEGG

Unit VI

Phylogenetic analysis and tree building

Introduction to phylogenetics,Methods of phylogenetic analysis,role of multiple sequence alignment algorithms in phylogenetic analysis,Automated tools for phylogenetic analysis,Construction of phylogenetic tree.

UnitVII

Biochemical database

Introduction to Biological databases,organization and management of databases,KEGG,EXGESCY,BRENDA,ERGO

Unit VIII

Introduction to Homology

Introduction to Homology,Levels of protein structure,homology modeling of proteins (Sequence to structure) Cn3D,rasMol and SPDbv in homology modeling –case studies

UNIT I INTRODUCTION TO BIOINFORMATICS

Introduction to Bioinformatics

Bioinformatics is the application of computer technology to the management of biological information. Computers are used to gather, store, analyze and integrate biological and genetic information which can then be applied to gene-based drug discovery and development. The need for Bioinformatics capabilities has been precipitated by the explosion of publicly available genomic information resulting from the Human Genome Project. The goal of this project - determination of the sequence of the entire human genome (approximately three billion base pairs) - will be reached by the year 2002. The science of Bioinformatics, which is the melding of molecular biology with computer science, is essential to the use of genomic information in understanding human diseases and in the identification of new molecular targets for drug discovery. In recognition of this, many universities, government institutions and pharmaceutical firms have formed bioinformatics groups, consisting of computational biologists and bioinformatics computer scientists. Such groups will be key to unraveling the mass of information generated by large scale sequencing efforts underway in laboratories around the world.

Scope of Bioinformatics

Bioinformatics Applications: Genomics, Proteomics and Transcrptiomics

Bioinformatics is a combination of molecular biology and computer sciences. It is that technology in which computers are used to gather, store, analyze and integrate biological and genetic information. The need for Bioinformatics arose when a project to determine the sequence of the entire human genome was initiated. This project was called the Human Genome Project. Bioinformatics is very important for the use of genomic information to understand human diseases and to identify new ways for gene-based drug discovery and development. Therefore, many universities, government institutions and pharmaceutical companies have come forward to form bioinformatics groups to do research related to computational biology so that better ways are used to make processes more efficient and less time consuming.

Bioinformatics in Proteomics

Proteomics is a branch of biotechnology that deals with the techniques of molecular biology, biochemistry, and genetics to analyze the structure, function, and interactions of the proteins produced by the genes of a particular cell, tissue, or organism. This technology is being improved continuously and new tactics are being introduced. In the current day and age it is possible to acquire the proteome data. Bioinformatics makes it easier to come up with new algorithms to handle large and heterogeneous data sets to improve the processes. To date, algorithms for image analysis of 2D gels have been developed. In case of mass spectroscopy, data analysis algorithms for peptide mass fingerprinting and peptide fragmentation fingerprinting have been developed.

Bioinformatics and Genomics

Genomics is the study of complex sets of genes, their expression and the most vital role they play in biology. The most important application of bioinformatics in genomics is the Human Genome Project through which more than 30,000 genes have been identified and secured through the sequencing of chemical base pairs which make up the DNA. It has thus enabled us to obtain necessary knowledge as to how these genes inter-relate and what functions they perform. Cures for many diseases are being discovered through this inter-relation where bioinformatics, no doubt, plays a pivotal role.

Bioinformatics and Transcriptiomics

Transcriptiomics deals with the study of messenger RNA molecules produced in an individual or population of a particular cell type. It is also referred to asExpression Profiling in which the expression level of mRNA, in a given cell population, is determined through DNA microarray technology. Bioinformatics is thus used for transcriptome analysis where mRNA expressions levels can be determined so as to see how a certain disease, like cancer, can be cured.

Parallel to the above mentioned fields, Bioinformatics is also being used in;
Molecular medicine, Personalised medicine, Preventative medicine, Gene therapy, Drug development, Microbial genome applications, Waste cleanup, Climate change Studies, Alternative energy sources, Biotechnology, Antibiotic resistance, Forensic analysis of microbes, Bio-weapon creation, Evolutionary studies, Crop improvement, Insect resistance, Improve nutritional quality, Development of Drought resistance varieties, Veterinary Science etc which are all quite debatable in their own capacity and will be discussed in further detail.

Bioinformatics is being used in following fields:

Molecular medicine
Personalised medicine
Preventative medicine
Gene therapy
Drug development
Microbial genome applications
Waste cleanup
Climate change Studies
Alternative energy sources
Biotechnology
Antibiotic resistance
Forensic analysis of microbes
Bio-weapon creation
Evolutionary studies
Crop improvement
Insect resistance
Improve nutritional quality
Development of Drought resistance varieties
Vetinary Science

Elementary Commands and Protocols, ftp,telnet,http,

File eXchange Protocol (FXP) and (FXSP) is a method of data transfer which uses the FTP protocol to transfer data from one remote server to another (inter-server) without routing this data through the client’s connection. Conventional FTP involves a single server and a single client; all data transmission is done between these two. In the FXP session, a client maintains a standard FTP connection to two servers, and can direct either server to connect to the other to initiate a data transfer. The advantage of using FXP over FTP is evident when a high-bandwidth server demands resources from another high-bandwidth server, but only a low-bandwidth client, such as a network administrator working away from location, has the authority to access the resources on both servers.

Transport Layer Security (TLS) and its predecessor, Secure Sockets Layer (SSL), are cryptographic protocols that provide security and data integrity for communications over networks such as the Internet. TLS and SSL encrypt the segments of network connections at the Transport Layer end-to-end. HyperText Transfer Protocol Standard application-level protocol used for exchanging files on the World Wide Web. HTTP runs on top of the TCP/IP protocol. Web browsers are HTTP clients that send file requests to Web servers, which in turn handle the requests via an HTTP service. HTTP was originally proposed in 1989 by Tim Berners-Lee, who was a coauthor of the 1.0 specification. HTTP in its 1.0 version was “stateless”: each new request from a client established a new connection instead of handling all similar requests through the same connection between a specific client and server. Version 1.1 includes persistent connections, decompression of HTML files by client browsers, and multiple domain names sharing the same IP address.

Telnet (teletype network) is a network protocol used on the Internet or local area networks to provide a bidirectional interactive communications facility. Typically, telnet provides access to a command-line interface on a remote host via a virtual terminal connection which consists of an 8-bit byte oriented data connection over the Transmission Control Protocol (TCP). User data is interspersed in-band with TELNET control information. The term telnet may also refer to the software that implements the client part of the protocol. Telnet client applications are available for virtually all computer platforms. Most network equipment and operating system with a TCP/IP stack support a Telnet service for remote configuration (including systems based on Windows NT). Because of security issues with Telnet, its use has waned in favor of SSH for remote access. Secure File Transfer Protocol or SFTP) is a network protocol that provides file transfer and manipulation functionality over any reliable data stream.

It is typically used with version two of the SSH protocol (TCP port 22) to provide secure file transfer, but is intended to be usable with other protocols as well.allows only file transfers, the SFTP protocol allows for a range of operations on remote files – it is more like a remote file system protocol. An SFTP client’s extra capabilities compared to an SCP client include resuming interrupted transfers, directory listings, and remote file removal. [1] For these reasons it is relatively simple to implement a GUI SFTP client compared with a GUI SCP client. SMTP Simple Mail Transfer Protocol, a protocol for sending e-mail messages between servers. Most e-mail systems that send mail over the Internet use SMTP to send messages from one server to another; the messages can then be retrieved with an e-mail client using either POP or IMAP. In addition, SMTP is generally used to send messages from a mail client to a mail server.

This is why you need to specify both the POP or IMAP server and the SMTP server when you configure your e-mail application. IMAP Internet Message Access Protocol, a protocol for retrieving e-mail messages. The latest version, IMAP4, is similar to POP3 but supports some additional features. For example, with IMAP4, you can search through your e-mail messages for keywords while the messages are still on mail server. You can then choose which messages to download to your machine. Short for Post Office Protocol, a protocol used to retrieve e-mail from a mail server. Most e-mail applications (sometimes called an e-mail client) use the POP protocol, although some can use the newer IMAP (Internet Message Access Protocol).

FTP

File Transfer Protocol (FTP) is a standard network protocol used to transfer files from one host to another host over a TCP-based network, such as the Internet. It is often used to upload web pages and other documents from a private development machine to a public web-hosting server. FTP is built on a client-server architecture and uses separate control and data connections between the client and the server.[1] FTP users may authenticate themselves using a clear-text sign-in protocol, normally in the form of a username and password, but can connect anonymously if the server is configured to allow it. For secure transmission that hides (encrypts) your username and password, as well as encrypts the content, you can try using a client that uses SSH File Transfer Protocol.

The first FTP client applications were interactive command-line tools, implementing standard commands and syntax. Graphical user interfaces have since been developed for many of the popular desktop operating systems in use today,[2][3] including general web design programs like Microsoft Expression Web, and specialist FTP clients such as CuteFTP.

TelNet

Telnet is a network protocol used on the Internet or local area networks to provide a bidirectional interactive text-oriented communications facility using a virtual terminal connection. User data is interspersed in-band with Telnet control information in an 8-bit byte oriented data connection over the Transmission Control Protocol (TCP).

Telnet was developed in 1969 beginning with RFC 15, extended in RFC 854, and standardized as Internet Engineering Task Force (IETF) Internet Standard STD 8, one of the first Internet standards.

Historically, Telnet provided access to a command-line interface (usually, of an operating system) on a remote host. Most network equipment and operating systems with a TCP/IP stack support a Telnet service for remote configuration (including systems based on Windows NT). Because of security issues with Telnet, its use for this purpose has waned in favor of SSH.

The term telnet may also refer to the software that implements the client part of the protocol. Telnet client applications are available for virtually all computer platforms. Telnet is also used as a verb. To telnet means to establish a connection with the Telnet protocol, either with command line client or with a programmatic interface. For example, a common directive might be: "To change your password, telnet to the server, login and run the passwd command." Most often, a user will be telnetting to a Unix-like server system or a network device (such as a router) and obtain a login prompt to a command line text interface or a character-based full-screen manager.

HTTP

The Hypertext Transfer Protocol (HTTP) is an application protocol for distributed, collaborative, hypermedia information systems.[1] HTTP is the foundation of data communication for the World Wide Web. Hypertext is a multi-linear set of objects, building a network by using logical links (the so called hyperlinks) between the nodes (e.g. text or words). HTTP is the protocol to exchange or transfer hypertext.

The standards development of HTTP was coordinated by the Internet Engineering Task Force (IETF) and the World Wide Web Consortium (W3C), culminating in the publication of a series of Requests for Comments (RFCs), most notably RFC 2616 (June 1999), which defines HTTP/1.1, the version of HTTP in common use.

HTTP functions as a request-response protocol in the client-server computing model. In HTTP, a web browser, for example, acts as a client, while an application running on a computer hosting a web site functions as a server. The client submits an HTTP request message to the server. The server, which stores content, or provides resources, such as HTML files, or performs other functions on behalf of the client, returns a response message to the client. A response contains completion status information about the request and may contain any content requested by the client in its message body.

A web browser (or client) is often referred to as a user agent (UA). Other user agents can include the indexing software used by search providers, known as web crawlers, or variations of the web browser such as voice browsers, which present an interactive voice user interface.

HTTP is designed to permit intermediate network elements to improve or enable communications between clients and servers. High-traffic websites often benefit from web cache servers that deliver content on behalf of the original, so-called origin server, to improve response time. HTTP proxy servers at network boundaries facilitate communication when clients without a globally routable address are located in private networks by relaying the requests and responses between clients and servers.

HTTP is an Application Layer protocol designed within the framework of the Internet Protocol Suite. The protocol definitions presume a reliable Transport Layer protocol for host-to-host data transfer.[2] The Transmission Control Protocol (TCP) is the dominant protocol in use for this purpose. However, HTTP has found application even with unreliable protocols, such as the User Datagram Protocol (UDP) in methods such as the Simple Service Discovery Protocol (SSDP).

HTTP Resources are identified and located on the network by Uniform Resource Identifiers (URIs)—or, more specifically, Uniform Resource Locators (URLs)—using the http or https URI schemes. URIs and the Hypertext Markup Language (HTML), form a system of inter-linked resources, called hypertext documents, on the Internet, that led to the establishment of the World Wide Web in 1990 by English computer scientist and innovator Tim Berners-Lee.

The original version of HTTP (HTTP/1.0) was revised in HTTP/1.1. HTTP/1.0 uses a separate connection to the same server for every request-response transaction, while HTTP/1.1 can reuse a connection multiple times, to download, for instance, images for a just delivered page. Hence HTTP/1.1 communications experience less latency as the establishment of TCP connections presents considerable overhead.

Primer on information theory

(Refer the foloowing URL for this topic)

http://www.ccrnp.ncifcrf.gov/~toms/paper/primer/primer.pdf

UNIT II SPECIAL TOPICS IN BIOINFORMATICS

Special topics in Bioinformatics

DNA mapping and sequencing

DNA restriction mapping is often the first operation being done to characterize some unknown DNA. There are some special enzymes called restriction enzymes which cut DNA molecule into pieces at every occurrence of a specified nucleotide sequence. Places where molecule is cut are called restriction sites and pieces of DNA created as a result are called restriction fragments. The length of these fragments can be easily measured by gel electrophoresis. The problem of DNA restriction mapping is to reconstruct positions of the restriction sites having lengths of all restriction fragments. This corresponds to the situation that cut is performed with a probability lower than 1. Having billions of DNA molecules it is very likely to obtain all pairwise distances. This kind of problem is called partial digest problem.

Figure 1. Partial digest problem [3].

Let X be a set of n points on the axis.ΔX is a multiset containing all the pairwise distances between points from X. Having a multiset L of pairwise distances we need to reconstruct X such that ΔX=L. One can see that this problem is non-unique. Moving set X along axis or reflecting it would not change pairwise distances. For simplicity we will assume that the molecule starts at position 0 and restriction sites are situated on the positive half-axis so the n'th point from X corresponds to the biggest number M contained in L. However, this will not overrun non-uniqueness problem entirely.

The simplest approach to solve a complete digest problem is an exhaustive search. The idea is to generate all possible n-element sets {0,x₂,...,x_n_-1,M} such that x_ibelongs to L and 0 <x_2,...< x_n_-1<M. Then we treat each set as X and calculate ΔX. If ΔX=L, a solution has been found. The main advantage of exhaustive search approach is its simplicity. However, complexity of the algorithm is definitely too big to use it in practice. This is why some different algorithm using branch and bound approach has been invented. The idea is as follows:

Make X ={0,M}, remove number M from L.
Find the next biggest number α in L. We know that restriction site can be either at position y =α (leftmost) or y = M -α (rightmost). Assume that restriction site is at the leftmost position .
Calculate distances between y and all elements of X. This will be noted as Δ(y,X).
If L contains Δ(y,X) then L = L -Δ(y,X) and go to the step 2 (or output a solution if L is empty). Otherwise there are two possibilities:

A. If only a leftmost position has been already checked assume y = M -α and go to the step 3.

B. If both leftmost and rightmost positions have been already checked, backtrack the last iteration and try again.

Time complexity of this algorithm in a worst case is exponential. However, in average case algorithm performs pretty well and it is very often used in practical problems.

Another problem strictly related to DNA is sequencing which consists in determining the sequence of nucleotides forming a DNA strand. First method of DNA sequencing called chain termination method [1] was invented by Sanger and worked well on sequences containing at most few hundreds of nucleotides. To sequence larger sample of DNA, one can use a method called shotgun. Long DNA strand is broken randomly into smaller pieces and sequenced with Sanger's method. This procedure is repeated multiple times so many overlapping fragments (called reads) are obtained. One can see that those reads are substrings of a longer genomic string. Hence, DNA sequencing can be brought to a shortest common superstring problem. The main disadvantage of such approach is that shortest common superstring problem is NP-complete so some greedy heuristics must be used to solve a problem in an efficient way [6].

Figure 2. Shotgun DNA sequencing method [4].

Rapid development of DNA microarrays resulted in a new sequencing technique called sequencing by hybridization. A DNA microarray contains short DNA fragments called probes. These probes can hybridize with a DNA strand being sequenced (for example ACT probe will hybridize with complementary TGA sequence). Hence, we can detect a presence of some sequence. Structure of the example microarray can be seen on the figure 3.

Figure 3. Example microarray made by Affymetrix [5].

Let's assume that microarray contains all possible nucleotide sequences of a length l (also called l-mers). Number of such l-mers is equal to 4^l. After testing our unknown DNA sequence with microarray we will obtain set of overlapping l-mers from which our strand is built. This set is called spectrum. Our aim is to reconstruct sequence of nucleotides knowing all the l-mers from which DNA is built. An example of such reconstruction for l = 4 is given on the figure 4. One can see that this problem is a special case of a shortest common superstring problem present in a traditional sequencing procedure. However, here we can find an efficient algorithm solving the problem

Figure 4.Reassembling an original sequence from a set of 4-mers.

We will say that l-mers p and q overlap if l-1 last letters of p is the same as first l-1 letters of q. Let's assume that each obtained l-mer is represented by a node in a directed graph. If l-mers p and q overlap, we connect vertices p and q with a directed edge. Finding a shortest common superstring is equivalent to finding a path connecting all the vertices visiting each exactly once which is a Hamiltonian path problem. Graph corresponding to our example (presented on the figure 5.) is very simple but in general Hamiltonian path problem is NP-complete and such approach is not used in practice. However, our graph can be easily transformed to some different graph with nodes denoted by all obtained (l-1)-mers. We make and edge from p to q if a spectrum contains an l-mer which first l-1 nucleotides corresponds to p and last l-1 corresponds to q. Hence, sequencing DNA is equivalent to finding a path visiting all edges. This is Eulerian path problem which is much simpler and can be solved by efficient polynomial Fleury’s algorithm [2]. The graph corresponding to our example is presented on the figure 6.

Figure 5. DNA sequencing as the Hamiltonian path problem.

Figure 6. DNA sequencing as the Eulerian path problem.

Map Alignment

(Refer this URL for map alignment)

http://schatzlab.cshl.edu/teaching/AssemblyClass/06.%20Whole%20Genome%20Alignment.pdf

Large scale sequencing methods ,shortgun and sanger method, cDNA sequencing Genome mapping

(Refer this URL for map alignment)

http://universe-review.ca/R11-16-DNAsequencing.htm

Map Assembly

A Whole Genome Map is a high-resolution, ordered, whole genome restriction map generated from single DNA molecules extracted from bacteria, yeast, or other fungi. Whole Genome Mapping is a novel technology with unique capabilities in the field of microbiology, with specific applications in the areas of Comparative Genomics, Strain Typing, and Whole Genome Sequence Assembly. Whole Genome Maps are generated de novo, independent of sequence information, require no amplification or PCR steps, and provide a comprehensive view of whole genome architecture. A Whole Genome Map is displayed in the MapCode pattern where the vertical lines indicate the locations of restriction sites, and the distance between the lines represent the restriction fragment size.

Benefits:

With Whole Genome Mapping, you will be able to investigate microbial structure, function, diversity and genetics— without the need for amplification, PCR, cloning, paired-end libraries, pure isolates, or genomic specific reagents. Using OpGen’s unique de novo Whole Genome Mapping Technology, the Argus Whole Genome Mapping System, BGI delivers high resolution, ordered whole genome restriction maps from single microbial DNA molecules.

Applications:

Comparative Genomics

Whole Genome Sequencing Assembly

Strain Typing

Physical maps have been historically one of the cornerstones of genome sequencing and map-based cloning strategies. They also support marker assisted breeding and EST mapping. The problem of building a high quality physical map is computationally challenging due to unavoidable noise in the input fingerprint data.

A novel compartmentalized method is proposed for the assembly of high quality physical maps from fingerprinted clones. The knowledge of genetic markers enables us to group clones into clusters so that clones in the same cluster are more likely to overlap. For each cluster of clones, a local physical map is first constructed using FingerPrinted Contigs (FPC). Then, all the individual maps are carefully merged into the final physical map. Experimental results on the genomes of rice and barley demonstrate that the compartmentalized assembly produces significantly more accurate maps, and that it can detect and isolate clones that would induce "chimeric" contigs if used in the final assembly.

The software is available for download at http://www.cs.ucr.edu/~sbozdag/assembler/

A physical map is a linear ordering of a set of overlapping clones in a genomic library. Physical maps are obtained from processing the signatures or fingerprints of all the clones in a library. Fingerprints can be generated by digesting clones with one or more restriction enzymes, or by hybridizing them to a carefully designed set of DNA probes. The computational problem is to build an overlap map of the clones that is consisted with the fingerprint data.

Physical maps have been historically one of the cornerstones of genome sequencing projects. For instance, in clone-by-clone sequencing, first a physical map is constructed; then, a minimum-cardinality set of overlapping clones that spans the genomic region represented by the physical map, called minimal tiling path (MTP), is selected. Finally, the clones in the MTP are sequenced one by one . The clone-by-clone sequencing method has been used to sequence several genomes including C. elegans , A. thaliana, H. sapiens, and O. sativa. In several recent whole-genome shotgun sequencing projects, physical maps have also been employed to validate and improve the quality of sequence assembl. This validation step has been used, for example, in the assembly of M. musculus, G. gallus , and O. anatinus .

The rapid market penetration of next-generation sequencing instruments (Roche/454, Illumina, and ABI SOLiD) is expected to bring physical mapping back to the center stage of genomics. Next-gen sequencing technologies produce massive amounts of short reads (about 200–300 bases for 454, 35 bases for Illumina and SOLiD) and therefore the de novo assembly of the whole eukaryotic genomes is extremely challenging . Arguably, the only realistic approach at this time is clone-by-clone sequencing, where each clone in the MTP is sequenced using next-gen technology, and the assembly is carried out separately clone by clone.

In addition to their prominence in sequencing projects, physical maps can also provide a robust infrastructure required by many applications in genomics such as marker assisted breeding , map-based cloning of a set of genes of interest . and EST mapping.

Physical maps can be built from data obtained by restriction digestion or hybridization experiments .In the former case, overlaps between clones are determined by a statistical method, then clones are arranged in an order that is consistent with the restriction fingerprint data. In the latter case, clone-probe associations (i.e., which clones hybridize to which probe) are used to find an arrangement of the probes such that clones can be ordered consistentlyIn practice, however, hybridization experiments rarely use single probes. Due to the time and cost involved, hybridizations between probes and clones are typically carried out for a pool of probes In this work, we assume that only clone-pool associations (hereafter called hybridization fingerprints) are available.

Nowadays almost all physical mapping projects that are based on restriction fingerprint data rely on a tool called FingerPrinted Contigs (FPC) . FPC implements an algorithm called consensus band (CB) that constructs a physical map using a combination of greedy and heuristic approaches. At the core of the CB algorithm, clones are assigned to contigs based on a coincidence score, called Sulston score, which measures the probability that two clones share a given number of restriction fragments (bands) according to a binomial probability distribution. Two bands are considered shared if their sizes are within a given tolerance value. Two clones are declared overlapping if their Sulston score is below a given cutoff threshold. For each contig, FPC builds a CB map, which is a coordinate system to which clones are aligned.

FPC does not attempt to resolve all the conflicts arising in the assembly of the physical map, but instead provides interactive features for manual editing. Although manual editing appears to be an unavoidable final step in any physical mapping project, this process is tedious, very time-consuming and requires a significant expertise. Obviously, the better the initial quality of the physical map produced by the algorithm, the smaller is amount of manual work involved.

Comparative sequence analysis

Comparative sequence analysis for the purpose of RNA structure determination usually begins with an alignment of related sequences. In our alignment procedure, closest relatives are aligned first on the basis of primary structure similarity; each group of aligned sequences is then treated collectively and aligned against other groups. Next, sets of conserved nucleotides are identified and used for aligning in the more variable regions. Finally, where little or no primary structure similarity exists, common secondary structural elements are used as additional markers.

In our derivation of secondary structure, we distinguish between base pairs that are supported by covariances and those that are not contradicted. (A covariance is the observation that a base pair in one organism is different by both bases when compared to the equivalent base pair in another organism.) If the two different pairs are of Watson-Crick type (G-C, A-U), we observe a compensating base change (CBC). Covariances and CBCs support the existence of a base pair because, during evolution, random single mutations that introduce an unstable pairing would not generally have been compensated for by another mutation that restored the stability unless it was required. Such an observation is positive evidence, and the more CBCs the stronger the evidence. Negative evidence is a mismatch which we define as neither a Watson-Crick pair nor G-U pair. Notably, sequence conservation provides neither positive nor negative evidence.

For each base pair we estimate positive and negative evidence by counting the number of CBCs and mismatches. The most conserved base pair at a given alignment position is identified. Then, the remaining pairs are added as CBCs where they covary. Our guideline is to consider base pairs supported if there is at least twice as much positive evidence as negative. As a general rule, when there is less, we prefer not to include a base pair. However, when a base pair is supported in a particular phylogenetic group and disproven in other groups, we include it as specific for that group.

Secondary Structures

Secondary structure models can be derived directly from the alignment. (RNA alignments might be used also to prove/disprove tertiary interations as well as interactions between more than one RNA molecule.) Typically, supported base pairs are juxtapositioned and connected with a symbol (line, circle, dot, etc.) to indicate the nature of the pairing. When there is negative evidence for a pair, the bases are spaced apart with no symbol between them.

Aligment Tools for Comparative sequence analysis

Pasting positions
Adds x positions to the left and/or y positions to the right of a CLUSTAL multiple sequence alignment.

Inter-block gap sizes
Calculates inter-block gap sizes for blocks in a CLUSTAL multiple alignment and checks for mismatches between aligned sequences and master sequences.

Consensus
Calculates the consensus for the CLUSTAL or MSF multiple alignment.

CLUSTALW
Builds a multiple alignment of sequences (or sequence segments, accession numbers as entries) using CLUSTALW

(For further information,Refer this URL for comparative sequence alignment)

ftp://ftp.sanger.ac.uk/pub4/opendoor/uruguay/2012/module4.pdf

UNIT III SEQUENCE ALIGNMENT AND DYNAMIC PROGRAMMING

Unit III

Sequencing Alignment and Dynamic Programming

Alignment-Local,Global alignment,pairwise and multiple sequence alignments.

Pairwise local/global alignment - Introduction

The EMBOSS-Align tool can be used for pairwise alignment of sequences. There are three types of alignments we can consider when aligning sequnces, optimal, global and local alignments.

Optimal alignments

The alignment that is the best, given a defined set of rules and parameter values for comparing different alignments. There is no such thing as the single best alignment, since optimality always depends on the assumptions one bases the alignment on. For example, what penalty should gaps carry? All sequence alignment procedures make some such assumptions.

Global alignment

An alignment that assumes that the two proteins are basically similar over the entire length of one another. The alignment attempts to match them to each other from end to end, even though parts of the alignment are not very convincing. A tiny example:

NLGPSTKDFGKISESREFDNQ

| |||| |

QLNQLERSFGKINMRLEDALV

Local alignment

An alignment that searches for segments of the two sequences that match well. There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Using the same sequences as above, one could get:

NLGPSTKDDFGKILGPSTKDDQ

||||

QNQLERSSNFGKINQLERSSNN

It may seem that one should always use local alignments. However, it may be difficult to spot an overall similarity, as opposed to just a domain-to-domain similarity, if one uses only local alignment, so global alignment is useful in some cases. You can produce a global or a local alignment with the Emboss Pairwise global and local alignment tool.

EMBOSS-Align is the tool we will use to compare 2 sequences:

The EMBOSS-Align tool contains two programs each using a different algorithm:

When you want an alignment that covers the whole length of both sequences, use the needle program.

When you are trying to find the best region of similarity between two sequences, use the water program.

The following is a slightly more deatiled explanation of the two programs, needle and water, used in EMBOSS-Align:

Needle program - This program uses the Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps) of two sequences when considering their entire length.

The Needleman-Wunsch algorithm is a member of the class of algorithms that can calculate the best score and alignment in the order of mn steps, (where 'n' and 'm' are the lengths of the two sequences). These dynamic programming algorithms were first developed for protein sequence comparison by Needleman and Wunsch, though similar methods were independently devised during the late 1960's and early 1970's for use in the fields of speech processing and computer science.

What is the optimal alignment? Dynamic programming methods ensure the optimal global alignment by exploring all possible alignments and choosing the best. It does this by reading in a scoring matrix that contains values for every possible residue or nucleotide match. Needle finds an alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix.

An important problem is the treatment of gaps, i.e., spaces inserted to optimise the alignment score. A penalty is subtracted from the score for each gap opened (the 'gap open' penalty) and a penalty is subtracted from the score for the total number of gap spaces multiplied by a cost (the 'gap extension' penalty).

Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.

This is a true implementation of the Needleman-Wunsch algorithm and so produces a full path matrix. It therefore cannot be used with genome sized sequences unless you've a lot of memory and a lot of time.

Needle is for aligning two sequences over their entire length. This works best with closely related sequences. If you use needle to align very distantly-related sequences, it will produce a result but much of the alignment may have little or no biological significance.

A true Needleman Wunsch implementation like needle needs memory proportional to the product of the sequence lengths. For two sequences of length 10,000,000 and 1,000 it therefore needs memory proportional to 10,000,000,000 characters. Two arrays of this size are produced, one of integers and one of floats so multiply that figure by 8 to get the memory usage in bytes. That doesn't include other overheads. Therefore only use water and needle for accurate alignment of reasonably short sequences.

Water program - Water uses the Smith-Waterman algorithm (modified for speed enhancements) to calculate the local alignment.

The Smith-Waterman algorithm is a member of the class of algorithms that can calculate the best score and local alignment in the order of mn steps, (where 'n' and 'm' are the lengths of the two sequences). These dynamic programming algorithms were first developed for protein sequence comparison by Smith and Waterman, though similar methods were independently devised during the late 1960's and early 1970's for use in the fields of speech processing and computer science.

A local alignment searches for regions of local similarity between two sequences and need not include the entire length of the sequences. Local alignment methods are very useful for scanning databases or other circumstances when you wish to find matches between small regions of sequences, for example between protein domains.

Dynamic programming methods ensure the optimal local alignment by exploring all possible alignments and choosing the best. It does this by reading in a scoring matrix that contains values for every possible residue or nucleotide match. Water finds an alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix.

Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.

Water is a true implementation of the Smith-Waterman algorithm and so produces a full path matrix. It therefore cannot be used with genome sized sequences unless you have a lot of memory and a lot of time.

Local alignment methods only report the best matching areas between two sequences - there may be a large number of alternative local alignments that do not score as highly. If two proteins share more than one common region, for example one has a single copy of a particular domain while the other has two copies, it may be possible to "miss" the second and subsequent alignments. You will be able to see this situation if you have done a dotplot and your local alignment does not show all the features you expected to see.

Water is for aligning the best matching subsequences of two sequences. It does not necessarily align whole sequences against each other; you should use needle if you wish to align closely related sequences along their whole lengths.

A true Smith Waterman implementation like water needs memory proportional to the product of the sequence lengths. For two sequences of length 10,000,000 and 1,000 it therefore needs memory proportional to 10,000,000,000 characters. Two arrays of this size are produced, one of integers and one of floating point (real) numbers so multiply that figure by 8 to get the memory usage in bytes. That doesn't include other overheads. Therefore only use water and needle for accurate alignment of reasonably short sequences.

Running an EMBOSS-Align alignment

We will compare the sequence for a fragment of rat myosin heavy chain 2A with a fragment of human myosin heavy chain 2A.

The 2 sequences are shown below in FASTA format, which consists of a one-line header starting with a ">" symbol, followed by the sequence name/description. The sequence is then entered on new line(s).

Fragment of rat myosin heavy chain 2A

>myosin heavy chain 2A, skeletal muscle - rat

kakkaitdaa mmaeelkkeq dtsahlermk knmeqtvkdl qlrldeaeql

alkggkkqiq klearvrele geveseqkrn veavkglrkh errvkeltyq

teedrknilr lqdlvdklqa kvksykrqae eaeeqsntnl skfrklqhel

eeaeeradia esqvnklrvk srevhtkvis ee

Fragment of human myosin heavy chain 2A

>myosin heavy chain 2A,skeletal muscle - human

elldaservq llhtqntsli ntkkkletdi sqmqgemedi

lqearnaeek akkaitdaam maeelkkeqd tsahlermkk

nmeqtvkdlq lrldeaeqla lkggkkqiqk learvreleg

eveseqkrna eavkglrkhe rrvkeltyqt eedrknilrl

qdlvdklqak vksykrqaee aeeqsntnla kfrklqhele

eaeeradiae sqvnklrvks revhtkvise e

Consider the following EMBOSS-Align submission form:

N.B. This tool can also be programmatically accessed as a web service.

The 2 sequences were entered in FASTA format, which consists of a one-line header starting with a ">" symbol, followed by the sequence name/description. The sequence is then entered on new line(s) and terminated with a line consisting of "//".
2 jobs were run, one with the EMBOSS global alignment program (needle), and one with the local alignment program (water).
As we are comparing 2 protein sequences, the molecule type was left on protein.
The default blosum62 matrix was used, and the default gap open of "10" and gap extend of "0.5" were also used.

Results of EMBOSS-Align alignments

Note that matching amino acids are connected with a "|" symbol. Mismatches would be connected with a space. A gap would be represented with a "-" symbol. Similar amino acids (e.g. leucine vs methionine) are connected via a "." symbol. Thus a sequence alignment can be represented in the format...

        DMFCNTEGQGIAMM

         |    ||||  ..

        TMG--NEGQGSETT

The %id is the percentage of identical matches between the two sequences over the reported aligned region.

The %similarity is the percentage of matches between the two sequences over the reported aligned region where the scoring matrix value is greater or equal to 0.0.

The Overall %id and Overall %similarity are calculated in a similar manner for the number of matches over the length of the longest of the two sequences.

Global Alignment Results

Matrix used Blosum62
Gap penalties used
Gap open 10.0
Gap extend 0.5

########################################

# Program:  needle

# Rundate:  Mon Jan 27 11:42:35 2003

# Align_format: srspair

# Report_file: /ebi/extserv/old-work/87881043667742.needle.res

########################################

#=======================================

# Aligned_sequences: 2

# 1: myosin

# 2: myosin

# Matrix: EBLOSUM62

# Gap_penalty: 10.0

# Extend_penalty: 0.5

# Length: 231

# Identity:     180/231 (77.9%)

# Similarity:   181/231 (78.4%)

# Gaps:          49/231 (21.2%)

# Score: 878.0

#=======================================

myosin             1                                                  k      1

myosin             1 elldaservqllhtqntslintkkkletdisqmqgemedilqearnaeek     50

myosin             2 akkaitdaammaeelkkeqdtsahlermkknmeqtvkdlqlrldeaeqla     51

                     ||||||||||||||||||||||||||||||||||||||||||||||||||

myosin            51 akkaitdaammaeelkkeqdtsahlermkknmeqtvkdlqlrldeaeqla    100

myosin            52 lkggkkqiqklearvrelegeveseqkrnveavkglrkherrvkeltyqt    101

                     |||||||||||||||||||||||||||||.||||||||||||||||||||

myosin           101 lkggkkqiqklearvrelegeveseqkrnaeavkglrkherrvkeltyqt    150

myosin           102 eedrknilrlqdlvdklqakvksykrqaeeaeeqsntnlskfrklqhele    151

                     |||||||||||||||||||||||||||||||||||||||:||||||||||

myosin           151 eedrknilrlqdlvdklqakvksykrqaeeaeeqsntnlakfrklqhele    200

myosin           152 eaeeradiaesqvnklrvksrevhtkvisee    182

                     |||||||||||||||||||||||||||||||

myosin           201 eaeeradiaesqvnklrvksrevhtkvisee    231

#---------------------------------------

#---------------------------------------

Local Alignment Results

Matrix used Blosum62
Gap penalties used
Gap open 10.0
Gap extend 0.5

########################################

# Program:  water

# Rundate:  Mon Jan 27 11:41:59 2003

# Align_format: srspair

# Report_file: /ebi/extserv/old-work/86951043667707.water.res

########################################

#=======================================

# Aligned_sequences: 2

# 1: myosin

# 2: myosin

# Matrix: EBLOSUM62

# Gap_penalty: 10.0

# Extend_penalty: 0.5

# Length: 182

# Identity:     180/182 (98.9%)

# Similarity:   181/182 (99.5%)

# Gaps:           0/182 ( 0.0%)

# Score: 878.0

#=======================================

myosin             1 kakkaitdaammaeelkkeqdtsahlermkknmeqtvkdlqlrldeaeql     50

                     ||||||||||||||||||||||||||||||||||||||||||||||||||

myosin            50 kakkaitdaammaeelkkeqdtsahlermkknmeqtvkdlqlrldeaeql     99

myosin            51 alkggkkqiqklearvrelegeveseqkrnveavkglrkherrvkeltyq    100

                     ||||||||||||||||||||||||||||||.|||||||||||||||||||

myosin           100 alkggkkqiqklearvrelegeveseqkrnaeavkglrkherrvkeltyq    149

myosin           101 teedrknilrlqdlvdklqakvksykrqaeeaeeqsntnlskfrklqhel    150

                     ||||||||||||||||||||||||||||||||||||||||:|||||||||

myosin           150 teedrknilrlqdlvdklqakvksykrqaeeaeeqsntnlakfrklqhel    199

myosin           151 eeaeeradiaesqvnklrvksrevhtkvisee    182

                     ||||||||||||||||||||||||||||||||

myosin           200 eeaeeradiaesqvnklrvksrevhtkvisee    231

#---------------------------------------

#---------------------------------------

The results of the local and global alignments in this case are very similar, as the sequences are short.

Concept of gap penalty and e-value.Alignment algorithms.

(Refer the following site for this material)

http://bioinformatics.oxfordjournals.org/content/18/11/1500.full.pdf

Dynamic programming in sequence alignment:Needleman-Wunsch Algorithm and Smith Waterman Algorithm,

(Refer the following site for this material)

http://biochem218.stanford.edu/Projects%202004/Chan.pdf

Aminoacid Substitution matrices (PAM,BLOSUM).

Introduction

It is assumed that the sequences being sought have an evolutionary ancestral sequence in common with the query sequence. The best guess at the actual path of evolution is the path that requires the fewest evolutionary events. All substitutions are not equally likely and should be weighted to account for this. Insertions and deletions are less likely than substitutions and should be weighted to account for this. It is necessary to consider that the choice of search algorithm influences the sensitivity and selectivity of the search. The choice of similarity matrix determines both the pattern and the extent of substitutions in the sequences the database search is most likely to discover.

There have been extensive studies looking at the frequencies in which amino acids substituted for each other during evolution. The studies involved carefully aligning all of the proteins in several families of proteins and then constructing phylogenetic trees for each family. Each phylogenetic tree can then be examined for the substitutions found on each branch. This can then be used to produce tables(scoring matrices) of the relative frequencies with which amino acids replace each other over a short evolutionary period. Thus a substitution matrix describes the likelihood that two residue types would mutate to each other in evolutionary time.

A substitution is more likely to occur between amino acids with similar biochemical properties. For example the hydrophobic amino acids Isoleucine(I) and valine(V) get a positive score on matrices adding weight to the likeliness that one will substitute for another. While the hydrophobic amino acid isoleucine has a negative score with the hydrophilic amino acid cystine(C) as the likeliness of this substitution occurring in the protein is far less. Thus matrices are used to estimate how well two residues of given types would match if they were aligned in a sequence alignment.

Guidelines for using matricies

Protein Query Length	Matrix	Open Gap	Extend Gap
>300	BLOSUM50	-10	-2
85-300	BLOSUM62	-7	-1
50-85	BLOSUM80	-16	-4
>300	PAM250	-10	-2
85-300	PAM120	-16	-4
35-85	MDM40	-12	-2
<=35	MDM20	-22	-4
<=10	MDM10	-23	-4

Importance of scoring matrices

Scoring matrices appear in all analysis involving sequence comparison.
The choice of matrix can strongly influence the outcome of the analysis.
Scoring matrices implicitly represent a particular theory of evolution.
Understanding theories underlying a given scoring matrix can aid in making proper choice.

Types of matrices

Differences between PAM and BLOSSUM

PAM matrices are based on an explicit evolutionary model (that is, replacements are counted on the branches of a phylogenetic tree), whereas the Blosum matrices are based on an implicit rather than explicit model of evolution.
The sequence variability in the alignments used to count replacements. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The Blosum matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps.
The method used to count the replacements is different, unlike the PAM matrix, the Blosum procedure uses groups of sequences within which not all mutations are counted the same.

Equivalent PAM and Blossum matrices

The following matrices are roughly equivalent...

PAM100 ==> Blosum90
PAM120 ==> Blosum80
PAM160 ==> Blosum60
PAM200 ==> Blosum52
PAM250 ==> Blosum45

Generally speaking...

The Blosum matrices are best for detecting local alignments.
The Blosum62 matrix is the best for detecting the majority of weak protein similarities.
The Blosum45 matrix is the best for detecting long and weak alignments.

PAM (Point Accepted Mutation) matrix

Amino acid scoring matrices are traditionally PAM (Point Accepted Mutation) matrices which refer to various degrees of sensitivity depending on the evolutionary distance between sequence pairs. In this manner PAM40 is most sensitive for sequences 40 PAMs apart. PAM250 is for more distantly related sequences and is considered a good general matrix for protein database searching. For nucleotide sequence searching a simpler approach is used which either convert a PAM40 matrix into match/mismatch values which takes into consideration that a purine may be replaced by a purine and a pyrimidine by a pyrimidine.

e.g. The PAM 250 matrix
This is appropriate for searching for alignments of sequence that have diverged by 250 PAMs, 250 mutations per 100 amino acids of sequence. Because of back mutations and silent mutations this corresponds to sequences that are about 20 percent identical.

C 12

G -3   5

P -3  -1   6

S  0   1   1   1

A -2   1   1   1   2

T -2   0   0   1   1   3

D -5   1  -1   0   0   0   4

E -5   0  -1   0   0   0   3   4

N -4   0  -1   1   0   0   2   1   2

Q -5  -1   0  -1   0  -1   2   2   1   4

H -3  -2   0  -1  -1  -1   1   1   2   3   6

K -5  -2  -1   0  -1   0   0   0   1   1   0   5

R -4  -3   0   0  -2  -1  -1  -1   0   1   2   3   6

V -2  -1  -1  -1   0   0  -2  -2  -2  -2  -2  -2  -2   4

M -5  -3  -2  -2  -1  -1  -3  -2   0  -1  -2   0   0   2   6

I -2  -3  -2  -1  -1   0  -2  -2  -2  -2  -2  -2  -2   4   2   5

L -6  -4  -3  -3  -2  -2  -4  -3  -3  -2  -2  -3  -3   2   4   2   6

F -4  -5  -5  -3  -4  -3  -6  -5  -4  -5  -2  -5  -4  -1   0   1   2   9

Y  0  -5  -5  -3  -3  -3  -4  -4  -2  -4   0  -4  -5  -2  -2  -1  -1   7  10

W  -8  -7  -6  -2  -6  -5  -7  -7  -4  -5  -3  -3   2  -6  -4  -5  -2   0   0  17

   C   G   P   S   A   T   D   E   N   Q   H   K   R   V   M   I   L   F   Y   W

In this example Isoleucine(I) is likely to be substituted by valine(V) and gets a score of 4. Isoleucine(I) is unlikely to be substituted for Cystine and gets a score of -2.

BLOSSUM (Blocks Substitution Matrix)

The BLOSUM matrices, also used for protein database search scoring (the default in blastp), are divided into statistical significance degrees which, in a way, are reminiscent of PAM distances. For example, BLOSUM64 is roughly equivalent to PAM 120. BLOSSUM Blocks Substitution Matrix). BLOSSUM matrices are most sensitive for local alignment of related sequences. The BLOSUM matrices are therefore ideal when tying to identify an unknown nucleotide sequence.

e.g. Blosum 45 Matrix

This is derived from sequence blocks clustered at the 45% identity level.

G  7

P -2   9

D -1  -1   7

E -2   0   2   6

N  0  -2   2   0   6

H -2  -2   0   0   1  10

Q -2  -1   0   2   0   1   6

K -2  -1   0   1   0  -1   1   5

R -2  -2  -1   0   0   0   1   3   7

S  0  -1   0   0   1  -1   0  -1  -1   4

T -2  -1  -1  -1   0  -2  -1  -1  -1   2   5

A  0  -1  -2  -1  -1  -2  -1  -1  -2   1   0   5

M -2  -2  -3  -2  -2   0   0  -1  -1  -2  -1  -1   6

V -3  -3  -3  -3  -3  -3  -3  -2  -2  -1   0   0   1   5

I -4  -2  -4  -3  -2  -3  -2  -3  -3  -2  -1  -1   2   3   5

L -3  -3  -3  -2  -3  -2  -2  -3  -2  -3  -1  -1   2   1   2   5

F -3  -3  -4  -3  -2  -2  -4  -3  -2  -2  -1  -2   0   0   0   1   8

Y -3  -3  -2  -2  -2   2  -1  -1  -1  -2  -1  -2   0  -1   0   0   3   8

W -2  -3  -4  -3  -4  -3  -2  -2  -2  -4  -3  -2  -2  -3  -2  -2   1   3  15

C -3  -4  -3  -3  -2  -3  -3  -3  -3  -1  -1  -1  -2  -1  -3  -2  -2  -3  -5  12

   G   P   D   E   N   H   Q   K   R   S   T   A   M   V   I   L   F   Y   W   C

In this example Isoleucine(I) is likely to be substituted by valine(V) and gets a score of 3. Isoleucine(I) is unlikely to be substituted for Cystine and gets a score of -3.

Summary

These 2 matrices both generally perform well, but give slightly different results. The Blosum matrices have often been the better performers, reflecting the fact that the Blosum matrices are based on the replacement patterns found in more highly conserved regions of the sequences. This seems to be an advantage as these more highly conserved regions are those discovered in database searches and they serve as anchor points in alignments involving complete sequences. It is expected that the replacements that occur in more highly conserved regions will be more restricted than those that occur in highly variable regions of the sequence. This is supported by the different pattern of positive and negative scores in the two families of matrices. These different patterns of positive and negative scores reflect different estimates of what constitute conservative and non conservative substitutions in the evolution of proteins. These differences reflect the differences in constructing the two families of matrices. Some of the difference is also likely to be because the Blosum matrices are based on much more data than the PAM matrices. The PAM matrices still perform quite well despite the small amount of data underlying them. The most likely reasons for this are the care used in constructing the alignments and phylogenetic trees used in counting replacements and the fact that they are based on a simple model of evolution and thus they still perform better than some of the more modern matrices that are less carefully constructed.

GONNET Matrix

A different method to measure differences among amino acids was developed by Gonnet, Cohen and Benner (1992) using exhaustive pairwise alignments of the protein databases as they existed at that time. They used classical distance measures to estimate an alignment of the proteins. They then used this data to estimate a new distance matrix. This was used to refine the alignment, estimate a new distance matrix and so on iteratively. They noted that the distance matrices (all first normalised to 250 PAMs) differed depending on whether they were derived from distantly or closely homologous proteins. They suggest that for initial comparisons their resulting matrix should be used in preference to a PAM250 matrix, and that subsequent refinements should be done using a PAM matrix appropriate to the distance between proteins.

 A     C    D    E      F      G       H      I      K      L      M       N      P      Q      R      S      T     V       W     Y     ..

0.6 0.125 -0.075 0     -0.575  0.125 -0.2   -0.2   -0.1   -0.3   -0.175  -0.075  0.075 -0.05  -0.15  0.275  0.15   0.025 -0.9   -0.55   A

    2.875 -0.8  -0.75  -0.2   -0.5   -0.325 -0.275 -0.7   -0.375 -0.225  -0.45  -0.775 -0.6   -0.55  0.025 -0.125  0     -0.25  -0.125  C

           1.175 0.675 -1.125  0.025  0.1   -0.95   0.125 -1     -0.75    0.55  -0.175  0.225 -0.075 0.125  0     -0.725 -1.3   -0.7    D

                 0.9   -0.975 -0.2    0.1   -0.675  0.3   -0.7   -0.5     0.225 -0.125  0.425  0.1   0.05  -0.025 -0.475 -1.075 -0.675  E

                        1.75  -1.3   -0.025  0.25  -0.825  0.5    0.4    -0.775 -0.95  -0.65  -0.8  -0.7   -0.55   0.025  0.9    1.275  F

                               1.65  -0.35  -1.125 -0.275  -1.1   -0.875  0.1   -0.4   -0.25  -0.25  0.1   -0.275 -0.825 -1     -1      G

                                      1.5   -0.55   0.15   -0.475 -0.325  0.3   -0.275  0.3    0.15 -0.05  -0.075 -0.5   -0.2    0.55   H

                                             1     -0.525   0.7    0.625 -0.7   -0.65  -0.475 -0.6  -0.45  -0.15   0.775 -0.45  -0.175  I

                                                    0.8    -0.525 -0.35   0.2   -0.15   0.375  0.675 0.025  0.025 -0.425 -0.875 -0.525  K

                                                            1      0.7   -0.75  -0.575 -0.4   -0.55 -0.525 -0.325  0.45  -0.175  0      L

                                                                   1.075 -0.55   -0.6  -0.25 -0.425 -0.35  -0.15   0.4   -0.25  -0.05   M

                                                                          0.95   -0.225 0.175 0.075  0.225  0.125 -0.55  -0.9   -0.35   N

                                                                                  1.9  -0.05 -0.225  0.1    0.025 -0.45  -1.25  -0.775  P

                                                                                        0.675 0.375  0.05   0     -0.375 -0.675 -0.425  Q

                                                                                              1.175 -0.05  -0.05  -0.5   -0.4   -0.45   R

                                                                                                     0.55   0.375 -0.25  -0.825 -0.475  S

                                                                                                            0.625  0     -0.875 -0.475  T

                                                                                                                   0.85  -0.65  -0.275  V

                                                                                                                          3.55   1.025  W

                                                                                                                                 1.95   Y

DNA Identity Matrix (Unitary Matrix )

Here a you only get a positive score for a match, and a score of -10000 for a mismatch. As such a high penalty is given for a mismatch, no substitution should be allowed, although a gap may be permitted.

     A          T         G        C

 A    1

 T   -10000     1

 G   -10000    -10000     1

 C   -10000    -10000    -10000    1

PAM matrices

PAM matrices are amino acid substitution matrices that encode the expected evolutionary change at the amino acid level. Each PAM matrix is designed to compare two sequences which are a specific number of PAM units apart. For example - the PAM120 score matrix is designed to compare between sequences that are 120 PAM units apart: The score it gives a pair of sequences is the (log of the) probabilities of such sequences evolving during 120 PAM units of evolution. For any specific pair (A_i, A_j) of amino acids the (i,j) entry in the PAM n matrix reflects the frequency at which A_i is expected to replace with A_j in two sequences that are n PAM units diverged. These frequencies should be estimated by gathering statistics on replaced amino acids.

Collecting statistics about amino acids substitution in order to compute the PAM matrices is relatively difficult for sequences that are distantly diverged, as mentioned in the previous section. But for sequences that are highly similar, i.e., the PAM divergence distance between them is small, finding the position correspondence is relatively easy since only few insertions and deletions took place. Therefore, in the first stage statistics were collected from aligned sequences that were believed to be approximately one PAM unit diverged and the PAM1 matrix could be computed based on this data, as follows: Let M_ij denote the observed frequency (= estimated probability) of amino acid A_i mutating into amino acid A_j during one PAM unit of evolutionary change. M is a 20x real matrix, with the values in each matrix column adding up to 1. There is a significant variance between the values in each column.

Figure: The top left corner 5x5 of the PAM1 matrix. We write 10⁴M_ij for convinience.

Once M is known, the matrix Mⁿ gives the probabilities of any amino acid mutating to any other during n PAM units. The (i,j) entry in the PAM n matrix is therefore:

where f(i) and f(j) are the observed frequencies of amino acids A_i and A_j respectively. This approach assumes that the frequencies of the amino acids remain constant over time, and that the mutational processes causing substitutions during an interval of one PAM unit operate in the same manner for longer periods. We take the log value of the probability in order to allow computing the total score of all substitutions using summation rather than multiplication. The PAM matrix is usually organized by dividing the amino acids to groups of relatively similar amino acids and all group members are located in consecutive columns in the matrix.

BLOSUM - BLOcks SUbstitution Matrix

The BLOSUM matrix is another amino acid substitution matrix, first calculated by Henikoff and Henikoff [5]. For its calculation only blocks of amino acid sequences with small change between them are considered. These blocks are called conserved blocks (See figure 3.2). One reason for this is that one needs to find a multiple alignment between all these sequences and it is easier to construct such an alignment with more similar sequences. Another reason is that the purpose of the matrix is to measure the probability of one amino acid to change into another, and the change between distant sequences may include also insertions and deletions of amino acids. Moreover, we are more interested in conservation of regions inside protein families, where sequences are quite similar, and therefore we restrict our examination to such.

Figure 3.2: Alignment of several sequences. The conserved blocks are marked.

$\fbox{ \begin{minipage}[h]{\textwidth} \begin{center}\input{lec03_figs/lec03_block.eepic} \setlength{\unitlength}{0.1000in} \end{center} \end{minipage} }$

The first stage of building the BLOSUM matrix is eliminating sequences, which are identical in more than x% of their amino acid sequence. This is done to avoid bias of the result in favor of a certain protein. The elimination is done either by removing sequences from the block, or by finding a cluster of similar sequences and replacing it by a new sequence that represents the cluster. The matrix built from blocks with no more the x% of similarity is called BLOSUM-x (e.g. the matrix built using sequences with no more then 50% similarity is called BLOSUM-50.)
The second stage is counting the pairs of amino acids in each column of the multiple alignment. For example in a column with the acids AABACA (as in the first column in the block in figure 3.2), there are 6 AA pairs, 4 AB pairs, 4 AC, and one BC. The probability q_i_,j for a pair of amino acids in the same column to be A_i and A_j is calculated, as well as the probability p_i of a certain amino acid to be A_i.
In the third stage the log odd ratio is calculated as $s_{i,j} = \log_2 \frac{q_{i,j}}{p_i p_j}$ . As final result we consider the rounded 2s_i_,j, this value is stored in the (i,j) entry of the BLOSUM-x matrix.
In contrast to the PAM matrices, more sequences are examined in the process of computing the BLOSUM matrix. Moreover, the sequences are of specific nature of resemblance, and therefore the two sets of matrices differ.
Comparing the efficiency of two matrices is done by calculating the ratio between the number of pairs of similar sequences discovered by a certain matrix but not discovered by another one and the number of pairs missed by the first but found by the other. According to this comparison BLOSUM-62 is found to be better than other BLOSUM-x matrices as well as than PAM-x matrices.

Sequence similarity search with database:BLAST and FASTA.

The statistics of global sequence comparison

Unfortunately, under even the simplest random models and scoring systems, very little is known about the random distribution of optimal global alignment scores. Monte Carlo experiments can provide rough distributional results for some specific scoring systems and sequence compositions, but these can not be generalized easily. Therefore, one of the few methods available for assessing the statistical significance of a particular global alignment is to generate many random sequence pairs of the appropriate length and composition, and calculate the optimal alignment score for each. While it is then possible to express the score of interest in terms of standard deviations from the mean, it is a mistake to assume that the relevant distribution is normal and convert this Z-value into a P-value; the tail behavior of global alignment scores is unknown. The most one can say reliably is that if 100 random alignments have score inferior to the alignment of interest, the P-value in question is likely less than 0.01. One further pitfall to avoid is exaggerating the significance of a result found among multiple tests. When many alignments have been generated, e.g. in a database search, the significance of the best must be discounted accordingly. An alignment with P-value 0.0001 in the context of a single trial may be assigned a P-value of only 0.1 if it was selected as the best among 1000 independent trials.

The statistics of local sequence comparison

Fortunately statistics for the scores of local alignments, unlike those of global alignments, are well understood. This is particularly true for local alignments lacking gaps, which we will consider first. Such alignments were precisely those sought by the original BLAST database search programs.

A local alignment without gaps consists simply of a pair of equal length segments, one from each of the two sequences being compared. A modification of the Smith-Waterman or Sellers algorithms will find all segment pairs whose scores can not be improved by extension or trimming. These are called high-scoring segment pairs or HSPs.
To analyze how high a score is likely to arise by chance, a model of random sequences is needed. For proteins, the simplest model chooses the amino acid residues in a sequence independently, with specific background probabilities for the various residues. Additionally, the expected score for aligning a random pair of amino acid is required to be negative. Were this not the case, long alignments would tend to have high score independently of whether the segments aligned were related, and the statistical theory would break down.
Just as the sum of a large number of independent identically distributed (i.i.d) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution . (We will elide the many technical points required to make this statement rigorous.) In studying optimal local sequence alignments, we are essentially dealing with the latter case . In the limit of sufficiently large sequence lengthsm and n, the statistics of HSP scores are characterized by two parameters, K and lambda. Most simply, the expected number of HSPs with score at least S is given by the formula

We call this the E-value for the score S.

This formula makes eminently intuitive sense. Doubling the length of either sequence should double the number of HSPs attaining a given score. Also, for an HSP to attain the score 2x it must attain the score x twice in a row, so one expects E to decrease exponentially with score. The parameters K and lambda can be thought of simply as natural scales for the search space size and the scoring system respectively.

Bit scores

Raw scores have little meaning without detailed knowledge of the scoring system used, or more simply its statistical parameters K and lambda. Unless the scoring system is understood, citing a raw score alone is like citing a distance without specifying feet, meters, or light years. By normalizing a raw score using the formula.

one attains a "bit score" S', which has a standard set of units. The E-value corresponding to a given bit score is simply

Bit scores subsume the statistical essence of the scoring system employed, so that to calculate significance one needs to know in addition only the size of the search space.

P-values

The number of random HSPs with score >= S is described by a Poisson distribution. This means that the probability of finding exactly a HSPs with score >=S is given by:

where E is the E-value of S given by equation (1) above. Specifically the chance of finding zero HSPs with score >=S is e^-E, so the probability of finding at least one such HSP is.

This is the P-value associated with the score S. For example, if one expects to find three HSPs with score >= S, the probability of finding at least one is 0.95. The BLAST programs report E-value rather than P-values because it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995. However, when E < 0.01, P-values and E-value are nearly identical.

Database searches

The E-value of equation (1) applies to the comparison of two proteins of lengths m and n. How does one assess the significance of an alignment that arises from the comparison of a protein of length m to a database containing many different proteins, of varying lengths? One view is that all proteins in the database are a priori equally likely to be related to the query. This implies that a low E-value for an alignment involving a short database sequence should carry the same weight as a low E-value for an alignment involving a long database sequence. To calculate a "database search" E-value, one simply multiplies the pairwise-comparison E-value by the number of sequences in the database. Recent versions of the FASTA protein comparison programs take this approach.
An alternative view is that a query is a priori more likely to be related to a long than to a short sequence, because long sequences are often composed of multiple distinct domains. If we assume the a priori chance of relatedness is proportional to sequence length, then the pairwise E-value involving a database sequence of length n should be multiplied by N/n, where N is the total length of the database in residues. Examining equation (1), this can be accomplished simply by treating the database as a single long sequence of length N. The BLAST programs take this approach to calculating database E-value. Notice that for DNA sequence comparisons, the length of database records is largely arbitrary, and therefore this is the only really tenable method for estimating statistical significance.

The statistics of gapped alignments

The statistics developed above have a solid theoretical foundation only for local alignments that are not permitted to have gaps. However, many computational experiments and some analytic results strongly suggest that the same theory applies as well to gapped alignments. For ungapped alignments, the statistical parameters can be calculated, using analytic formulas, from the substitution scores and the background residue frequencies of the sequences being compared. For gapped alignments, these parameters must be estimated from a large-scale comparison of "random" sequences.

Some database search programs, such as FASTA or various implementation of the Smith-Waterman algorithm, produce optimal local alignment scores for the comparison of the query sequence to every sequence in the database. Most of these scores involve unrelated sequences, and therefore can be used to estimate lambda and K . This approach avoids the artificiality of a random sequence model by employing real sequences, with their attendant internal structure and correlations, but it must face the problem of excluding from the estimation scores from pairs of related sequences. The BLAST programs achieve much of their speed by avoiding the calculation of optimal alignment scores for all but a handful of unrelated sequences. The must therefore rely upon a pre-estimation of the parameters lambda and K, for a selected set of substitution matrices and gap costs. This estimation could be done using real sequences, but has instead relied upon a random sequence model, which appears to yield fairly accurate results.

Edge effects

The statistics described above tend to be somewhat conservative for short sequences. The theory supporting these statistics is an asymptotic one, which assumes an optimal local alignment can begin with any aligned pair of residues. However, a high-scoring alignment must have some length, and therefore can not begin near to the end of either of two sequences being compared. This "edge effect" may be corrected for by calculating an "effective length" for sequences; the BLAST programs implement such a correction. For sequences longer than about 200 residues the edge effect correction is usually negligible.

The content of this page was mainly stolen from http://www.psc.edu/biomed/dist-ed/ and http://www.ncbi.nlm.nih.gov/BLAST/tutorial/

BLAST

BLAST: A simplification of Smith-Waterman

The BLAST algorithm (20) uses a word based heuristic similar to that of FASTA to approximate a simplification of the Smith-Waterman algorithm known as the maximal segment pairs algorithm. Maximal segment pairs alignments do not allow gaps and are specified by an equation that includes only the first and fourth terms of the Smith-Waterman equation presented above. Maximal segment pair alignments have the very valuable property that their statistics are well understood (15). Thus, we can readily compute a significance probability for a maximal segment pair alignment. Recent advances in maximal segment pairs statistics allow the use of several independent segment alignments to be used in evaluating the significance probability.
The price for being able to readily compute a significance probability is that the alignments can not have gaps (15). Thus, the evolutionary model requires that there be a fairly long stretch of sequence that has evolved without insertions or deletions, or at least with a complimentary pattern of insertions and deletions that do not significantly disrupt the alignment.

The BLAST algorithm is less sensitive than Smith-Waterman and in appropriate circumstances more selectivity. For proteins the BLAST word based heuristic is more sensitive than the FASTA heuristic even though BLAST uses a word size of three for proteins while FASTA uses a word size of two. However, BLAST uses a very long word size, eleven, for nucleic acid sequences. Also the modifications to the heuristic that improve the sensitivity for protein sequences do not work as well for nucleic acid sequences because have only a four letter alphabet and the similarity scores are usually calculated with equal rates of replacement for all of the nucleotides. Thus, FASTA is more sensitive than BLAST for nucleic acid sequences and should used instead of BLAST.
The BLAST word based heuristic uses a default word size of three for proteins and eleven for nucleic acid sequences. The tables below illustrate how the BLAST heuristic differs from the FASTA heuristic using a word size of two applied to a short protein query sequence. A word size of two is used in the example to keep it to a managable size.

Creating a Word List for BLAST, Word Size = 2
Adipokinetic hormone II - Migratory locust

 q l n f s a g w

q l

l n

n f

f s

s a

a g

g w

In the BLAST heuristic this list of words is expanded in order to recover the sensitivity lost by matching only identical words. Any word that scores at least a minimum threshold when aligned with any of the initial list of words is added to the list. BLAST then examines the database sequences for words that exactly match any of the expanded list. The expanded below to a total list of 47 words derived from the initial 7. The expanded list contains any word that scores at least eight when aligned with the initial word and scored with the PAM 120 similarity table.

Initial

  Word    Expanded List

    ql:   ql, qm, hl, zl



    ln:   ln, lb

    nf:   nf, af, ny, df, qf, ef, gf, hf, kf, sf, tf, bf, zf



    fs:   fs, fa, fn, fd, fg, fp, ft, fb, ys

    sa:  no words score 8 or more, including the initial word sa



    ag:   ag



    gw:   gw, aw, rw, nw, dw, qw, ew, hw, iw, kw, mw, pw, sw,

          tw, vw,bw, zw, xw

One noteworthy feature of the table above is that there is no word that scores eight or more when aligned with the initial word "sa". This situation does occur in actual BLAST searches. The user has the option to force the initial word into the final list. The default is not to include such low scoring words because they contain so little information that they are unlikely to be critical in finding the maximal segment pair alignment. In practice, BLAST has used a word length of 3 for protein database searches with a required threshold score of 13 using the Blosum62 similarity scoring matrix.

Recent Developments in BLAST: More Sensitivity and Gaps

The inventors of the BLAST algorithm have continued to look for ways to improve both the sensitivity and the speed of their algorithm (23). As we will see later, the growth of the sequence databases has raised the miniimum score and hence the length of alignment that must be found by BLAST for a match to be significant. For these longer alignments both the speed and the sensitivity of BLAST can be improved by requiring that the algorithm find two matches above some (lower) threshold rather than one match. Both matches must be on the same diagonal. The increase in speed results from improved specificity so that fewer sequences are completely evaluated In practice BLAST now looks for two words of length 3 that each score at least 11 using the Blosum62 similarity scoring matrix. The two matches must be within 40 amino acids of each other on the same diagonal.
Another recent improvement in the BLAST algorithm is to provide a rigorous gapped alignment as part of the results. The first steps that BLAST made towards providing gapped alignments was to find a number of high scoring segments in addition to the maximal scoring segment. A later version used a strategy similar to that found in FASTA. In this strategy the maximal scoring segment is used to define a band in the path graph and the Smith-Waterman (17) algorithm is used to find a gapped alignment that lays entirely within the band. However, as noted in the discussion of FASTA, this strategy can produce a less than optimal Smith-Waterman alignment if the number of gaps needed in the best gapped alignment is greater than the width of the band.

The new gapped BLAST gets around this problem and still avoids the high computational cost of a full Smith-Waterman alignment on the pair of sequences by building the alignment out from a central high scoring pair of aligned amino acids in a way analogous to BLAST extends the initial maximal segment pair alignment. The initial pair of aligned amino acids is chosen as the middle pair of the highest scoring window of 11 amino acids in the high scoring segment pair alignments. The Smith-Waterman alignment is then extended in all directions in the path graph until it falls below a fixed percentage of the highest score yet computed in the Smith-Waterman phase.

This method of computing the Smith-Waterman gapped alignment will always find the best scoring Smith-Waterman alignment if two conditions are met. First, the calculation is extended until a lower bounds of a score of zero is reached. Using a higher threshold for stopping the calculation accepts a very small risk of not finding the complete alignment in return for a very large savings in computer resources. The second criterion that must be met is that the initial pair of amino acids selected as the midpoint from which to extend the alignment must actually be part of the alignment that would be reported as the best alignment by a complete Smith-Waterman alignment of the pair of sequences. Again the selection criteria used by gapped BLAST appear to be very effective in selecting an appropriate pair of amino acids from which to extend the alignment.

FASTA a fast approximation to Smith-Waterman

FASTA: A fast approximation to Smith-Waterman

FASTA is a two step algorithm (19). The first step is a search for highly similar seqments in the two sequences. In this search a word with a specific word size is used to find regions in a two-dimensional table table similar to that shown for the Smith-Waterman algorithm. . These regions are a diagonal or a few closely spaced diagonals in the table which have a high number of identical word matches between the sequences. The second step is a Smith-Waterman alignment centered on the diagonals that correspond to the alignment of the highly similar sequence segments. The region for the Smith-Waterman alignment is bounded by a window size. The window size limits the number of insertions or deletions one sequence can accumulate with respect to the other sequence in the alignment. Thus, the significant speedups in observed in a FASTA search relative to a full Smith-Waterman search is due to the prior restriction in alignment space as well as skipping the Smith-Waterman step when no diagonals are found that correspond to aolignments between highly similar sequence segments.
The FASTA algorithm is a heuristic approximation to the Smith-Waterman algorithm. The heuristics used by FASTA allow it to run much faster that the Smith-Waterman algorithm but at the cost of some sensitivity. Two heuristics are employed. Both can be interpreted as restrictions on the model of sequence evolution that is used in comparing the sequences. The first restriction is implemented by the word size parameter, usually two for proteins and six for nucleic acids. This means that FASTA constrains the evolution between a pair of sequences to preserve a number of unchanged dipeptides or hexanucleotides. The effect of this word size restriction can be seen in the dot plots shown below.
In the dot plots below each asterisks marks the beginning of a string of matches in the two short nucleotide sequence shown at the edges of the dot plot. The length of the matches thus varies between one and three nucleotides. As can be seen with a word size of one the dot plot is so noisy that it is difficult to pick out any region that might correspond to an alignment that accurately reflects homology between the sequences. However, when we shift to a word size of two the region more or less in the center of the dot plot stands out as offering the most possibilities for a long alignment with a high degree of sequence identity. In cases like this the loss of sensitivity resulting from increasing the word size has also resulted in a worthwhile increase in selectivity. In the final dot plot, with a word size of three, not only has most of the noise in the dot plot been removed but so has most of the information indicating the regions involved in the best alignment between the two sequences. Given the same scoring parameters the best alignment should be the same alignment found by the Smith-Waterman algorithm and shown above.
Dot Plot: Word Size = 1

    g  c  t  g  g  a  a  g  g  c  a  t

g   *

  *  *        *  *

c      *                       *

a                  *

 *           *

g   *        *  *        *  *

a                  *  *

      *

g   *        *  *        *  *

c      *                       *

a

                 *  *           *

c      *                       *

t

    *                          *

Dot Plot: Word Size = 2

    g  c  t  g  g  a  a  g  g  c  a  t

g   *

                 *

c                              *

a

 *

g               *

a                     *

g   *

    *

c                              *

a

c      *

t

Dot Plot: Word Size = 3

    g  c  t  g  g  a  a  g  g  c  a  t

g

                 *

c

a

g

a

g                           *

c

a

c

t

The first step in the FASTA algorithm is to divide the query sequence into its constituent overlapping words of length two for proteins or six for nucleic acids. This process is shown in the table below. Then as each sequence is read from the database it is also divided into its constituent overlapping words. These two list of words are compared to find the identical words in both sequences. An initial score is computed based on the number of identities concentrated within small regions of the dot plot. If this initial score is high enough a second score is computed by evaluating which of the initial identities can be joined into a consistent alignment using only gaps of less than some maximum length. This maximum lengths for gaps is set by the window size parameter. Finally, if the secondary score is high enough then a Smith-Waterman alignment is performed within the region of the dot plot defined by the concentrated identities and the window size. This third score is reported as the optimal score.

Creating a Word List for FASTA, Word Size = 6



    g  c  t  g  g  a  a  g  g  c  a  t

    g  c  t  g  g  a

       c  t  g  g  a  a



          t  g  g  a  a  g

             g  g  a  a  g  g

                g  a  a  g  g  c



                   a  a  g  g  c  a

                      a  g  g  c  a  t

The weaknesses in the FASTA approach can be shown in two pathological examples. The first example would have two proteins that share 50% identity - but the proper alignment consists of alternating match and mismatches. With a word size of two, there would be no word matches along the main diagonal of the dot plot for the sequences (although there will potentially be random or spurious word matches on the off-diagonals) and the proper alignment would probably not be found. The second case consists of two proteins that are almost identical, except the second protein has a 20 residue insertion into the middle of the sequence. If the window size is 15, then the Smith-Waterman alignment phase of FASTA will not have enough alignment space to jump the 20 residue insertion. Thus, the resulting alignment will be either the sequence prior to or after the insertion (whichever had the higher diagonal scores) and the fact that the proteins were basically identical (with only one long insertion) will be missed.

The window size discussed in the previous paragraph is the second heuristic used by FASTA. Its effect is more variable than that of the word size. If the best alignment lies entirely within the window defined by the window size and the concentrated identities there is no effect. However if the best alignment, as defined by a full Smith-Waterman analysis, goes outside the window then a lower scoring alignment will be found by FASTA. This can lead the user to conclude that the sequences are not homologous when in fact they are and homology could have been inferred from a full Smith-Waterman alignment.
In practice these pathological cases are very unlikely. However, similar cases do occur and the loss of sensitivity caused by the use of these heuristics will be seen in the examples presented at the end of the tutorial.