A Text Book On BIOINFORMATICS -BY ZAHOORULLAH S MD: UNIT V SECONDARY DATABASE

Secondary database

Introduction to Biological databases,organization and management of databases

The biological sciences encompass an enormous variety of information, from the

environmental sciences, which give us a view of how species live and interact in a world illed with natural phenomena to cell biology, which provide knowledge about the inner tructure and function of the cell and beyond. All this information requires classification, rganization and management. Biological data exhibits many special characteristics that ake management of biological information a particularly challenging problem. A Multidisciplinary field called bioinformatics has emerged recently to address information management of genetic information with special emphasis on DNA and protein sequence analysis. However, bioinformatics harness all other types of biological information and the modeling, storage, retrieval, and management of that information. Moreover, applications of bioinformatics span new drug target validation and development of novel drugs, study of mutations and related diseases, anthropological investigations on migration patterns of tribes and therapeutic treatments.

Specific features of biological data

1. Biological data is highly complex when compared with most other domains or

applications. Definitions of such data must thus be able to represent a complex

substructure of data as well as relationships and to ensure that information is not lost during biological data modeling. Biological information systems must be able to represent any level of complexity in any data schema, relationship, or schema substructure. A good example of such system is MITOMAP database documenting the human mitochondrial genome (http://www.mitomap.org). The database information include data and their relationship about ca. 17,000 nucleotide bases of the mitochondrial DNA; 52 gene loci encoding mRNAs, rRNAs and tRNAs; over 1,500 known population variants and over 60 disease associations. MITOMAP includes links to over 3,000 literature references. Traditional RDBMS or ODBMS are unable to capture all aspects of the database data.

2. The amount and range of variability in biological data is high. Therefore, the

systems handling biological data should be flexible in data types and values. Constraints on data types and values must be put with care, since the unexpected values (e.g. outliers), which are not uncommon in biological data could be excluded resulting in the lost of information.

3. Schemas in biological databases change rapidly. This requires improved

information flow between various database releases, as well as schema evolution and data object migration support. In most relational and object database systems the ability to extend the schema is not supported. What happens now is that many

biological/bioinformatics databases (such as GenBank, for example) release the entire database with new schemas once or twice a year rather than incrementally change the system as a change is needed.

4. Representations of the same data by different biologists will likely be different

(even using the same system). Thus, it is necessary to have mechanisms, which could align different biological schemas.

5. Most users of biological data need read-only access only, whereas write access

to the database is not required. Usually curators of the databases are the ones who need write access privileges. The vast majority of users generate a wide variety of read-access patterns into the database, but these patterns are not the same as those seen in traditional relational databases. User requested searches demand indexing of often unexpected combinations of data classes.

6. Most biologists don’t have knowledge of the internal structure of the database or

about its schema design. Biological database interfaces should display information to users in a manner that is applicable to the problem they are trying to address and that reflects the underlying data structure in an easily understandable manner. Biologists usually know what data they require, but they have no technical knowledge of the data structure or how a DBMS represents the data. Relational database schemas fail to provide intuitive information to the user regarding the meaning of their schema. Web interfaces, on the other hand, often provide preset search interfaces, which may limit access into the database.

7. The context of data gives added meaning for its use in biological applications.

Therefore it is important that the context is maintained and conveyed to the user when appropriate. It is also advantageous to integrate as many context as possible to maximize the interpretation of the biological data. For instance, the sequence of DNA is not very useful without information describing its organization, function, etc.

8. Defining and representing complex queries is extremely important to the

biologist. Hence, biological systems must support complex queries and provide tools for building such queries.

9. Users of biological information often require access to “old” values of the data –

particularly when verifying previously reported results. Therefore, the changes in values must be supported through the archives to enable researchers to reconstruct previous work and reevaluate prior and current information. All these specific characteristics of the biological data point to the fact that traditional DBMS’s do not fully satisfy the requirements put on complex biological data.

Existing Biological Databases

It has been estimated that are over major 1,000 public and commercial biological

databases currently available to scientific community (by the end of 2006). These

biological databases usually contain genomic and/or proteomics data. Some databases are also used in taxonomy. As already has been mentioned, the biological databases incorporate enormous ammount of various types of biological data including (but certainly not limited to) nucleotide sequences of genes, amino acids sequences of proteins, information about their function, structure, localization on chromosome, clinical effects of mutations, protein-ligand, gene-ligand interactions, as well as similarities of biological sequences can be found and so on. By far the most important resource for biological databases is a special yearly January issue of the journal Nucleic Acids Research. This issue categorizes all the publicly available online databases related to bioinformatics.

The most important biological databases can be roughly classified into the following

groups:

- Primary sequence databases (include International Nucleotide Sequence Database

(INSD) consisting of DDBJ [DNA Data Bank of Japan], EMBL Nucleotide DB [European Molecular Biology Laboratory] and GenBank [National Center for Biotechnology Information]). These databanks represent the current knowledge about the sequences of all organisms. They interchange the stored information and are the source for many other databases.

- Meta-databases (include MetaDB containing links and descriptions for over 1200

biological databases, Entrez [National Center for Biotechnology Information], euGenes [Indiana University], GeneCards [Weizmann Institute], SOURCE [Stanford University], Harvester [EMBL Heidelberg] and other). These meta-database can be considered a database of databases, rather than one integration project or technology. They collect information from different other sources and usually make them available in new and more convenient form.

- Genome browsers (e.g. Integrated Microbial Genomes system, The Ensembl

Genome Browser [Sanger Institute and European Bioinformatics Institute] and many other). Genome Browsers enable researchers to visualize and browse entire genomes of organisms with annotated data including gene prediction and structure, proteins, expression, regulation, variation, comparative analysis, etc. Annotated data is coming usually from multiple diverse sources.

- Specialized databases (Human Genome Organization database, SHMPD The

Singapore Human Mutation and Polymorphism Database and many other databases).

- Pathway databases (e.g. BioCyc Database, Reactome and other)

- Protein sequence databases (UniProt [UniProt Consortium: EBI, Expasy, PIR],

Swiss-Prot Protein Knowledgebase [Swiss Institute of Bioinformatics] and many other)

- Protein structure databases (Protein Data Bank, CATH Protein Structure

Classification, SCOP Structural Classification of Proteins etc.)

- Microarray databases (ArrayExpress [EBI], SMD [Stanford University], etc.)

- Protein-Protein Interactions (BioGRID [Samuel Lunenfeld Research Institute],

STRING [EMBL])

Database bioinformatics tools

While database systems provide facilities to manage large data volumes, many

database systems only partially provide support for the numeric computations required to perform statistical assessment of scientific data and therefore require further development. This shortcoming limits the use of database systems by scientific users. The integration of numerical algebraic calculations enables to perform automatic optimization of entire computations, with the resulting benefits of query optimization, algorithm selection and data independence becoming available to computations on scientific databases. This removes the barrier between the database system and the computation allowing the database optimizer to manipulate a larger portion of the application.

Algebraic Optimization of Computations.

A pioneering work of Wolniewicz and Graefe extends the concept of database

query and show on a case study how numeric computation over time series data can be implemented effectively in scientific databases. This frees the user from concerns about ordering of computation steps, algorithm selection, use of indices and other physical properties such as data distribution. The authors developed a scientific optimizer using the “Volcano” optimizer generator, which could perform logical transformations and physical algorithm selection.

In the optimization of scientific computations, the identification of suitable

transformation rules is of central importance. Once applicable transformation rules have been found and applied to generate equivalent logical expressions, the optimizer must find a set of physical algorithms that can implement or execute each expression. For instance, a join operator can be implemented as either a merge- or hash-based algorithm, while an interpolation can be implemented by any of variety of curve fitting algorithms. Other query optimizer issues include limiting the search space, detecting common subexpressions and improving cost estimation.

Time series can be viewed as sets there each record is assumed to be tagged with a

time value. Start and stop time, the average and maximum difference between samples, and whether or not the time differences are constant between adjacent records are all important for some operations. Spectra are treated in a manner very similar to time series, but with a frequency attribute attached to each record rather than a time value.

The operators supported within the system are divided into two groups, logical and physical. The user’s original computation is composed of logical operators, while the final execution plan is expressed to the database in terms of physical operators. The authors demonstrate that besides standard relational operators as “select”, “project” and “join” other necessary operators can be included such as “a random sampling operator”, “digital filtering procedure”, “recomputation of the value of a single record based upon an averaging function applied to nearby records”, “interpolations” and “extrapolation” operators, operator for “merging two time series”, as well as “spectral filter” operator.

The authors also include simple math function applications such as “correlation”,

“convolution” and “deconvolution” of spectra. Additionally, physical operators

implemented as iterators are included such as “pass a window over a sorted data set”, “fast Fourier Transform” and some other. It should be noted that some operators (e.g. Fourier transform of the time series and spectra) are “expensive” operators to perform and, therefore, the decision on when to move between normal and Fourier-space is important for optimization process. Logical transformations for scientific operators are also vital to the application of optimization. There are a number of transformations in specific scientific domains that are considered valid and thus needs to be implemented.

Moreover, the user should be able to enable or disable certain transformations to control the accuracy of the results of the equation. The authors conclude that it is very beneficial to remove the barrier between the database system and scientific computations over the database. Thus, automatic optimization of integrated algebras is a crucial step in supporting scientific computations over database systems.

Sequence Retrieval System (SRS)

The high amount and diversity of data available in the large number of different

databases to the scientific community is to the user’s advantage but it also creates the problem of knowing exactly where to get specific information. Another problem is that different databases with different contents also use different formats adding further technical difficulties to the already complex task of accessing and exploiting the data.

Even though many databases have developed linking systems to relate to data stored

elsewhere, these links are difficult to use due to the differences between individual

database implementations. With the expansion of databases, quick access to their contents has become an important issue. Different technical solutions are used or are under development at the major bioinformatics sites. The major contribution to these field was development of Sequence Retrieval System (SRS) at EBI. SRS has addressed many of the difficulties in database access, integration of databases and analysis tools. SRS is an integration system for both data retrieval and data analysis applications. It provides, under a uniform and simple to use interface, a high degree of linking between the databases in the system. This allows the user to perform simple and complex queries across different databases. Its original way of linking makes it possible to query across databases that even do not contain direct cross-references to each other.

The user even is able to ask questions like “Give me all the proteins that share InterPro domains with my protein” or “Give me all the known 3D structures of this set of proteins”. The SRS server at EBI (http://srs.ebi.ac.uk) contains more than 140 biological databases (including sequence and sequence related databases, bibliography, metabolic pathways, 3D structure and many other databases) and integrates many analysis tools. Results of such analyses are themselves indexed as SRS databases and can thus be linked to others by using predefined or user-defined SRS views.

Overall, the EBI and other major world bioinformatics centers provide a whole

range of analysis tools for databases. There are currently more than 60 distinct services available at the EBI such as a range of sequence homology and similarity algorithms like (FASTA, BLAST and Smith-Waterman), sequence analysis tools (many European Molecular Biology Open Software Suite applications), gene and structural prediction methods

SWISSPROT

UniProtKB/Swiss-Prot is the manually annotated and reviewed section of the UniProt Knowledgebase (UniProtKB). It is a high quality annotated and non-redundant protein sequence database, which brings together experimental results, computed features and scientific conclusions. Since 2002, it is maintained by the UniProt consortium and is accessible via the UniProt website.

The most important source of information on protein sequences are the Swiss-Prot

+ TrEMBL protein sequence databases (http://www.expasy.ch/sprot/). The Swiss-Prot protein knowledgebase is an annotated protein sequence database, maintained collaboratively by the Swiss Institute of Bioinformatics and the EBI. It strives to provide sequences from all species, combined with a high level of manual annotation, a minimal level of redundancy and a high level of integration with other biomolecular databases. To make new protein sequences available for the public as quickly as possible without relaxing the high annotation standards of Swiss-Prot, the EBI provides a complement to Swiss-Prot known as TrEMBL. TrEMBL consists of computer-annotated entries derived from the translation of all coding sequences in the DDBJ/EMBL/GenBank Nucleotide Sequence Database.

The TrEMBL section of UniProtKB was introduced in 1996 in response to the increased dataflow resulting from genome projects. It was already recognized at that time that the traditional time- and labour-consuming manual annotation process which is the hallmark of Swiss-Prot could not be broadened to encompass all available protein sequences. Publicly available protein sequences obtained from the translation of annotated coding sequences in the EMBL-Bank/GenBank/DDBJ nucleotide sequence database are automatically processed and entered in UniProtKB/TrEMBL where they are computed-annotated in order to make them swiftly available to the public.

UniProtKB/TrEMBL contains high quality computationally analyzed records that are enriched with automatic annotation and classification. These UniProtKB/TrEMBL unreviewed entries are kept separated from the UniProtKB/Swiss-Prot manually reviewed entries so that the high quality data of the latter is not diluted in any way.

A well-defined manual curation process is essential to ensure that all manually annotated entries are handled in a consistent manner. This process consists of 6 major mandatory steps: (1) sequence curation, (2) sequence analysis, (3) literature curation, (4) family-based curation, (5) evidence attribution, (6) quality assurance and integration of completed entries. Curation is performed by expert biologists using a range of tools that have been iteratively developed in close collaboration with curators.

(1) Sequence curation. Once a protein sequence has been selected for manual annotation on the basis of our curation priorities, Blast searches are run against UniProtKB to identify additional sequences from the same gene and to identify homologs. Sequences from the same gene and the same organism are merged into a single entry. Discrepancies between sequence reports are identified and the underlying causes such as alternative splicing, natural variations, frameshifts, incorrect initiation sites, incorrect exon boundaries and unidentified conflicts are documented. Further errors can be found by comparing homologous sequences. These steps ensure that the sequence described for each protein in UniProtKB/Swiss-Prot is as complete and correct as possible and contribute to the accuracy and quality of further curation steps.

Where do the UniProtKB protein sequences come from?

What are UniProtKB's criteria for defining a CDS as a protein?

What is the canonical sequence? Are all isoforms described in one entry?

(2) Sequence analysis. Sequences are analyzed using a range of selected sequence analysis tools. Computer-predictions are manually reviewed and relevant results selected for integration. Sequence annotation predictions include post-translational modifications, subcellular location, transmembrane domains and protein topology, domain identification and protein family classification.

(3) Literature curation. Journal articles provide the main source of experimental protein knowledge. Relevant publications are identified by searching literature databases, such as PubMed, and using literature mining tools. The full text of each paper is read and information is extracted and added to the entry. All experimental findings and authors' statements are compared both with the current knowledge on related proteins and the results from various protein sequence analysis tools. Annotation captured from the scientific literature includes protein and gene names, function, catalytic activity, cofactors, subcellular location, protein-protein interactions, patterns of expression, diseases associated with deficiencies in a protein, locations and roles of significant domains and sites, ion-, substrate- and cofactor-binding sites, catalytic residues, the variant protein forms produced by natural genetic variation, RNA editing, alternative splicing, proteolytic processing, and post-translational modification. Relevant Gene Ontology (GO) terms are assigned based on experimental data from the literature.

References

On what basis are literature references inserted in UniProtKB/Swiss-Prot entries?

(4) Family-based curation. Reciprocal Blast searches and phylogenetic resources are used to identify putative homologs, which are evaluated and curated. Annotation is standardized and propagated across homologous proteins to ensure data consistency.

How is orthology established in UniProtKB/Swiss-Prot?

How is protein family membership assigned in UniProtKB?

(5) Evidence attribution. All information added to an entry during the manual annotation process is linked to the original source so that users can trace back the origin of each piece of information and evaluate it.

Non-experimental qualifiers

Evidence on protein existence

(6) Quality assurance, integration and update. Each completed entry undergoes quality assurance before integration into UniProtKB/Swiss-Prot and is updated as new data become available.

PIR

The Protein Information Resource (PIR) is an integrated public bioinformatics resource to support genomic, proteomic and systems biology research and scientific studies (Wu et al., 2003).

PIR was established in 1984 by the National Biomedical Research Foundation (NBRF) as a resource to assist researchers in the identification and interpretation of protein sequence information. Prior to that, the NBRF compiled the first comprehensive collection of macromolecular sequences in the Atlas of Protein Sequence and Structure, published from 1965-1978 under the editorship of Margaret O. Dayhoff. Dr. Dayhoff and her research group pioneered in the development of computer methods for the comparison of protein sequences, for the detection of distantly related sequences and duplications within sequences, and for the inference of evolutionary histories from alignments of protein sequences.

Dr. Winona Barker and Dr. Robert Ledley assumed leadership of the project after the untimely death of Dr. Dayhoff in 1983. In 1999 Dr. Cathy H. Wu joined NBRF, and later on Georgetown University Medical Center (GUMC), to head the bioinformatics efforts of PIR, and has served first as Principal Investigator and, since 2001, as Director.

For over four decades, beginning with the Atlas of Protein Sequence and Structure, PIR has provided protein databases and analysis tools freely accessible to the scientific community including the Protein Sequence Database (PSD).

In 2002 PIR, along with its international partners, EBI (European Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics), were awarded a grant from NIH to create UniProt, a single worldwide database of protein sequence and function, by unifying the PIR-PSD, Swiss-Prot, and TrEMBL databases.

In 2009 Dr. Wu accepted the Edward G. Jefferson Chair of Bioinformatics and Computational Biology at the University of Delaware (UD).

Today, PIR maintains staff at UD and GUMC and continues to offer world leading resources to assist with proteomic and genomic data integration and the propagation and standardization of protein annotation.

(Go through the below link and download the different PIR databases)

http://pir.georgetown.edu/pirwww/about/brochure.pdf

KEGG

(Refer the following UNIT VII for KEGG material in this site)