A Text Book On BIOINFORMATICS -BY ZAHOORULLAH S MD: UNIT II SPECIAL TOPICS IN BIOINFORMATICS

Special topics in Bioinformatics

DNA mapping and sequencing

DNA restriction mapping is often the first operation being done to characterize some unknown DNA. There are some special enzymes called restriction enzymes which cut DNA molecule into pieces at every occurrence of a specified nucleotide sequence. Places where molecule is cut are called restriction sites and pieces of DNA created as a result are called restriction fragments. The length of these fragments can be easily measured by gel electrophoresis. The problem of DNA restriction mapping is to reconstruct positions of the restriction sites having lengths of all restriction fragments. This corresponds to the situation that cut is performed with a probability lower than 1. Having billions of DNA molecules it is very likely to obtain all pairwise distances. This kind of problem is called partial digest problem.

Figure 1. Partial digest problem [3].

Let X be a set of n points on the axis.ΔX is a multiset containing all the pairwise distances between points from X. Having a multiset L of pairwise distances we need to reconstruct X such that ΔX=L. One can see that this problem is non-unique. Moving set X along axis or reflecting it would not change pairwise distances. For simplicity we will assume that the molecule starts at position 0 and restriction sites are situated on the positive half-axis so the n'th point from X corresponds to the biggest number M contained in L. However, this will not overrun non-uniqueness problem entirely.

The simplest approach to solve a complete digest problem is an exhaustive search. The idea is to generate all possible n-element sets {0,x₂,...,x_n_-1,M} such that x_ibelongs to L and 0 <x_2,...< x_n_-1<M. Then we treat each set as X and calculate ΔX. If ΔX=L, a solution has been found. The main advantage of exhaustive search approach is its simplicity. However, complexity of the algorithm is definitely too big to use it in practice. This is why some different algorithm using branch and bound approach has been invented. The idea is as follows:

Make X ={0,M}, remove number M from L.
Find the next biggest number α in L. We know that restriction site can be either at position y =α (leftmost) or y = M -α (rightmost). Assume that restriction site is at the leftmost position .
Calculate distances between y and all elements of X. This will be noted as Δ(y,X).
If L contains Δ(y,X) then L = L -Δ(y,X) and go to the step 2 (or output a solution if L is empty). Otherwise there are two possibilities:

A. If only a leftmost position has been already checked assume y = M -α and go to the step 3.

B. If both leftmost and rightmost positions have been already checked, backtrack the last iteration and try again.

Time complexity of this algorithm in a worst case is exponential. However, in average case algorithm performs pretty well and it is very often used in practical problems.

Another problem strictly related to DNA is sequencing which consists in determining the sequence of nucleotides forming a DNA strand. First method of DNA sequencing called chain termination method [1] was invented by Sanger and worked well on sequences containing at most few hundreds of nucleotides. To sequence larger sample of DNA, one can use a method called shotgun. Long DNA strand is broken randomly into smaller pieces and sequenced with Sanger's method. This procedure is repeated multiple times so many overlapping fragments (called reads) are obtained. One can see that those reads are substrings of a longer genomic string. Hence, DNA sequencing can be brought to a shortest common superstring problem. The main disadvantage of such approach is that shortest common superstring problem is NP-complete so some greedy heuristics must be used to solve a problem in an efficient way [6].

Figure 2. Shotgun DNA sequencing method [4].

Rapid development of DNA microarrays resulted in a new sequencing technique called sequencing by hybridization. A DNA microarray contains short DNA fragments called probes. These probes can hybridize with a DNA strand being sequenced (for example ACT probe will hybridize with complementary TGA sequence). Hence, we can detect a presence of some sequence. Structure of the example microarray can be seen on the figure 3.

Figure 3. Example microarray made by Affymetrix [5].

Let's assume that microarray contains all possible nucleotide sequences of a length l (also called l-mers). Number of such l-mers is equal to 4^l. After testing our unknown DNA sequence with microarray we will obtain set of overlapping l-mers from which our strand is built. This set is called spectrum. Our aim is to reconstruct sequence of nucleotides knowing all the l-mers from which DNA is built. An example of such reconstruction for l = 4 is given on the figure 4. One can see that this problem is a special case of a shortest common superstring problem present in a traditional sequencing procedure. However, here we can find an efficient algorithm solving the problem

Figure 4.Reassembling an original sequence from a set of 4-mers.

We will say that l-mers p and q overlap if l-1 last letters of p is the same as first l-1 letters of q. Let's assume that each obtained l-mer is represented by a node in a directed graph. If l-mers p and q overlap, we connect vertices p and q with a directed edge. Finding a shortest common superstring is equivalent to finding a path connecting all the vertices visiting each exactly once which is a Hamiltonian path problem. Graph corresponding to our example (presented on the figure 5.) is very simple but in general Hamiltonian path problem is NP-complete and such approach is not used in practice. However, our graph can be easily transformed to some different graph with nodes denoted by all obtained (l-1)-mers. We make and edge from p to q if a spectrum contains an l-mer which first l-1 nucleotides corresponds to p and last l-1 corresponds to q. Hence, sequencing DNA is equivalent to finding a path visiting all edges. This is Eulerian path problem which is much simpler and can be solved by efficient polynomial Fleury’s algorithm [2]. The graph corresponding to our example is presented on the figure 6.

Figure 5. DNA sequencing as the Hamiltonian path problem.

Figure 6. DNA sequencing as the Eulerian path problem.

Map Alignment

(Refer this URL for map alignment)

http://schatzlab.cshl.edu/teaching/AssemblyClass/06.%20Whole%20Genome%20Alignment.pdf

Large scale sequencing methods ,shortgun and sanger method, cDNA sequencing Genome mapping

(Refer this URL for map alignment)

http://universe-review.ca/R11-16-DNAsequencing.htm

Map Assembly

A Whole Genome Map is a high-resolution, ordered, whole genome restriction map generated from single DNA molecules extracted from bacteria, yeast, or other fungi. Whole Genome Mapping is a novel technology with unique capabilities in the field of microbiology, with specific applications in the areas of Comparative Genomics, Strain Typing, and Whole Genome Sequence Assembly. Whole Genome Maps are generated de novo, independent of sequence information, require no amplification or PCR steps, and provide a comprehensive view of whole genome architecture. A Whole Genome Map is displayed in the MapCode pattern where the vertical lines indicate the locations of restriction sites, and the distance between the lines represent the restriction fragment size.

Benefits:

With Whole Genome Mapping, you will be able to investigate microbial structure, function, diversity and genetics— without the need for amplification, PCR, cloning, paired-end libraries, pure isolates, or genomic specific reagents. Using OpGen’s unique de novo Whole Genome Mapping Technology, the Argus Whole Genome Mapping System, BGI delivers high resolution, ordered whole genome restriction maps from single microbial DNA molecules.

Applications:

Comparative Genomics

Whole Genome Sequencing Assembly

Strain Typing

Physical maps have been historically one of the cornerstones of genome sequencing and map-based cloning strategies. They also support marker assisted breeding and EST mapping. The problem of building a high quality physical map is computationally challenging due to unavoidable noise in the input fingerprint data.

A novel compartmentalized method is proposed for the assembly of high quality physical maps from fingerprinted clones. The knowledge of genetic markers enables us to group clones into clusters so that clones in the same cluster are more likely to overlap. For each cluster of clones, a local physical map is first constructed using FingerPrinted Contigs (FPC). Then, all the individual maps are carefully merged into the final physical map. Experimental results on the genomes of rice and barley demonstrate that the compartmentalized assembly produces significantly more accurate maps, and that it can detect and isolate clones that would induce "chimeric" contigs if used in the final assembly.

The software is available for download at http://www.cs.ucr.edu/~sbozdag/assembler/

A physical map is a linear ordering of a set of overlapping clones in a genomic library. Physical maps are obtained from processing the signatures or fingerprints of all the clones in a library. Fingerprints can be generated by digesting clones with one or more restriction enzymes, or by hybridizing them to a carefully designed set of DNA probes. The computational problem is to build an overlap map of the clones that is consisted with the fingerprint data.

Physical maps have been historically one of the cornerstones of genome sequencing projects. For instance, in clone-by-clone sequencing, first a physical map is constructed; then, a minimum-cardinality set of overlapping clones that spans the genomic region represented by the physical map, called minimal tiling path (MTP), is selected. Finally, the clones in the MTP are sequenced one by one . The clone-by-clone sequencing method has been used to sequence several genomes including C. elegans , A. thaliana, H. sapiens, and O. sativa. In several recent whole-genome shotgun sequencing projects, physical maps have also been employed to validate and improve the quality of sequence assembl. This validation step has been used, for example, in the assembly of M. musculus, G. gallus , and O. anatinus .

The rapid market penetration of next-generation sequencing instruments (Roche/454, Illumina, and ABI SOLiD) is expected to bring physical mapping back to the center stage of genomics. Next-gen sequencing technologies produce massive amounts of short reads (about 200–300 bases for 454, 35 bases for Illumina and SOLiD) and therefore the de novo assembly of the whole eukaryotic genomes is extremely challenging . Arguably, the only realistic approach at this time is clone-by-clone sequencing, where each clone in the MTP is sequenced using next-gen technology, and the assembly is carried out separately clone by clone.

In addition to their prominence in sequencing projects, physical maps can also provide a robust infrastructure required by many applications in genomics such as marker assisted breeding , map-based cloning of a set of genes of interest . and EST mapping.

Physical maps can be built from data obtained by restriction digestion or hybridization experiments .In the former case, overlaps between clones are determined by a statistical method, then clones are arranged in an order that is consistent with the restriction fingerprint data. In the latter case, clone-probe associations (i.e., which clones hybridize to which probe) are used to find an arrangement of the probes such that clones can be ordered consistentlyIn practice, however, hybridization experiments rarely use single probes. Due to the time and cost involved, hybridizations between probes and clones are typically carried out for a pool of probes In this work, we assume that only clone-pool associations (hereafter called hybridization fingerprints) are available.

Nowadays almost all physical mapping projects that are based on restriction fingerprint data rely on a tool called FingerPrinted Contigs (FPC) . FPC implements an algorithm called consensus band (CB) that constructs a physical map using a combination of greedy and heuristic approaches. At the core of the CB algorithm, clones are assigned to contigs based on a coincidence score, called Sulston score, which measures the probability that two clones share a given number of restriction fragments (bands) according to a binomial probability distribution. Two bands are considered shared if their sizes are within a given tolerance value. Two clones are declared overlapping if their Sulston score is below a given cutoff threshold. For each contig, FPC builds a CB map, which is a coordinate system to which clones are aligned.

FPC does not attempt to resolve all the conflicts arising in the assembly of the physical map, but instead provides interactive features for manual editing. Although manual editing appears to be an unavoidable final step in any physical mapping project, this process is tedious, very time-consuming and requires a significant expertise. Obviously, the better the initial quality of the physical map produced by the algorithm, the smaller is amount of manual work involved.

Comparative sequence analysis

Comparative sequence analysis for the purpose of RNA structure determination usually begins with an alignment of related sequences. In our alignment procedure, closest relatives are aligned first on the basis of primary structure similarity; each group of aligned sequences is then treated collectively and aligned against other groups. Next, sets of conserved nucleotides are identified and used for aligning in the more variable regions. Finally, where little or no primary structure similarity exists, common secondary structural elements are used as additional markers.

In our derivation of secondary structure, we distinguish between base pairs that are supported by covariances and those that are not contradicted. (A covariance is the observation that a base pair in one organism is different by both bases when compared to the equivalent base pair in another organism.) If the two different pairs are of Watson-Crick type (G-C, A-U), we observe a compensating base change (CBC). Covariances and CBCs support the existence of a base pair because, during evolution, random single mutations that introduce an unstable pairing would not generally have been compensated for by another mutation that restored the stability unless it was required. Such an observation is positive evidence, and the more CBCs the stronger the evidence. Negative evidence is a mismatch which we define as neither a Watson-Crick pair nor G-U pair. Notably, sequence conservation provides neither positive nor negative evidence.

For each base pair we estimate positive and negative evidence by counting the number of CBCs and mismatches. The most conserved base pair at a given alignment position is identified. Then, the remaining pairs are added as CBCs where they covary. Our guideline is to consider base pairs supported if there is at least twice as much positive evidence as negative. As a general rule, when there is less, we prefer not to include a base pair. However, when a base pair is supported in a particular phylogenetic group and disproven in other groups, we include it as specific for that group.

Secondary Structures

Secondary structure models can be derived directly from the alignment. (RNA alignments might be used also to prove/disprove tertiary interations as well as interactions between more than one RNA molecule.) Typically, supported base pairs are juxtapositioned and connected with a symbol (line, circle, dot, etc.) to indicate the nature of the pairing. When there is negative evidence for a pair, the bases are spaced apart with no symbol between them.

Aligment Tools for Comparative sequence analysis

Pasting positions
Adds x positions to the left and/or y positions to the right of a CLUSTAL multiple sequence alignment.

Inter-block gap sizes
Calculates inter-block gap sizes for blocks in a CLUSTAL multiple alignment and checks for mismatches between aligned sequences and master sequences.

Consensus
Calculates the consensus for the CLUSTAL or MSF multiple alignment.

CLUSTALW
Builds a multiple alignment of sequences (or sequence segments, accession numbers as entries) using CLUSTALW

(For further information,Refer this URL for comparative sequence alignment)

ftp://ftp.sanger.ac.uk/pub4/opendoor/uruguay/2012/module4.pdf