UNIT V SECONDARY DATABASE



Secondary database


Introduction to Biological databases,organization and management of databases

The biological sciences encompass an enormous variety of information, from the
environmental sciences, which give us a view of how species live and interact in a world  illed with natural phenomena to cell biology, which provide knowledge about the inner  tructure and function of the cell and beyond. All this information requires classification,  rganization and management. Biological data  exhibits many special characteristics that  ake management of biological information a particularly challenging problem. A  Multidisciplinary field called bioinformatics has emerged recently to address information  management of genetic information with special emphasis on DNA and protein sequence  analysis. However, bioinformatics harness all other types of biological information and the  modeling, storage, retrieval, and management of that information. Moreover, applications  of bioinformatics span new drug target validation and development of novel drugs, study  of mutations and related diseases, anthropological investigations on migration patterns of  tribes and therapeutic treatments.


Specific features of biological data



1. Biological data is highly complex when compared with most other domains or
applications. Definitions of such data must thus be able to represent a complex
substructure of data as well as relationships and to ensure  that information is not lost  during biological data modeling. Biological information systems must be able to represent  any level of complexity in any data schema, relationship, or schema substructure. A good  example of such system is MITOMAP database documenting the human mitochondrial  genome (http://www.mitomap.org). The database information include data and their  relationship about ca. 17,000 nucleotide bases of the mitochondrial DNA; 52 gene loci  encoding mRNAs, rRNAs and tRNAs; over 1,500 known population variants and over 60  disease associations. MITOMAP includes links to over 3,000 literature references.  Traditional RDBMS or ODBMS are unable to capture all aspects of the database data.

2. The amount and range of variability in biological data is high. Therefore, the
systems handling biological data should be flexible in data types and values. Constraints  on data types and values must be put with care, since the unexpected values (e.g. outliers),  which are not uncommon in biological data could be excluded resulting in the lost of  information.

3. Schemas in biological databases change rapidly. This requires improved
information flow between various database releases, as well as schema evolution and data  object migration support. In most relational  and object database systems the ability to  extend the schema is not supported.  What happens now is that many
biological/bioinformatics databases (such as GenBank, for example) release the entire  database with new schemas once or twice a year rather than incrementally change the  system as a change is needed.

4. Representations of the same data by different biologists will likely be different
(even using the same system). Thus, it is necessary to have mechanisms, which could align  different biological schemas. 

5. Most users of biological data need read-only access only, whereas write access
to the database is not required. Usually curators of the databases are the ones who need  write access privileges. The vast majority of users generate a wide variety of read-access  patterns into the database, but these patterns are not the same as those seen in traditional  relational databases. User requested searches demand indexing of often unexpected  combinations of data classes. 

6. Most biologists don’t have knowledge of the internal structure of the database or
about its schema design. Biological database interfaces should display information to users  in a manner that is applicable to the problem they are trying to address and that reflects the  underlying data structure in an easily understandable manner. Biologists usually know  what data they require, but they have no technical knowledge of the data structure or how  a DBMS represents the data. Relational database schemas fail to provide intuitive  information to the user regarding the meaning of their schema. Web interfaces, on the  other hand, often provide preset search  interfaces, which may limit access into the  database.

7. The context of data gives added meaning for its use in biological applications.
Therefore it is important that the context is maintained and conveyed to the user when  appropriate. It is also advantageous to integrate as many context as possible to maximize  the interpretation of the biological data. For instance, the sequence of DNA is not very  useful without information describing its organization, function, etc.

8. Defining and representing complex queries is extremely important to the
biologist. Hence, biological systems must support complex queries and provide tools for  building such queries.

9. Users of biological information often require access to “old” values of the data –
particularly when verifying previously reported results. Therefore, the changes in values  must be supported through the archives to enable researchers to reconstruct previous work  and reevaluate prior and current information.  All these specific characteristics of the biological data point to the fact that  traditional DBMS’s do not fully satisfy the requirements put on complex biological data.
Existing Biological Databases


It has been estimated that are over major 1,000 public and commercial biological
databases currently available to scientific community (by the end of 2006). These
biological databases usually contain genomic and/or proteomics data. Some databases are  also used in taxonomy. As already has been mentioned, the biological databases  incorporate enormous ammount of various types of biological data including (but certainly  not limited to) nucleotide sequences of genes, amino acids sequences of proteins,  information about their function, structure, localization on chromosome, clinical effects of  mutations, protein-ligand, gene-ligand interactions, as well  as similarities of biological  sequences can be found and so on. By far the most important resource for biological  databases is a special yearly January issue  of the journal Nucleic Acids Research. This  issue categorizes all the publicly available online databases related to bioinformatics. 

The most important biological databases can  be roughly classified into the following
groups:

- Primary sequence databases (include International Nucleotide Sequence Database
(INSD) consisting of DDBJ [DNA Data Bank of Japan], EMBL Nucleotide DB [European  Molecular Biology Laboratory] and GenBank [National Center for Biotechnology  Information]). These databanks represent the current knowledge about the sequences of all  organisms. They interchange the stored information and are the source for many other  databases.

- Meta-databases (include MetaDB containing links and descriptions for over 1200
biological databases, Entrez [National Center for Biotechnology Information], euGenes  [Indiana University], GeneCards [Weizmann Institute], SOURCE [Stanford University],   Harvester [EMBL Heidelberg] and other). These meta-database can be considered a  database of databases, rather than one integration project or technology. They collect information from different other sources and usually make them available in new and  more convenient form.

-  Genome browsers (e.g. Integrated Microbial Genomes system, The Ensembl
Genome Browser [Sanger Institute and European Bioinformatics Institute] and many  other). Genome Browsers enable researchers to visualize and browse entire genomes of  organisms with annotated data including gene prediction and structure, proteins,  expression, regulation, variation,  comparative analysis, etc.  Annotated data is coming  usually from multiple diverse sources.

-  Specialized databases (Human Genome Organization database, SHMPD The
Singapore Human Mutation and Polymorphism Database and many other databases).

- Pathway databases (e.g. BioCyc Database, Reactome and other)

-  Protein sequence databases (UniProt [UniProt Consortium: EBI, Expasy, PIR],
Swiss-Prot Protein Knowledgebase [Swiss Institute of Bioinformatics] and many other)

-  Protein structure databases (Protein Data Bank, CATH Protein Structure
Classification, SCOP Structural Classification of Proteins etc.)

- Microarray databases (ArrayExpress [EBI], SMD [Stanford University], etc.)

-  Protein-Protein Interactions (BioGRID [Samuel Lunenfeld Research Institute],
STRING [EMBL])




Database bioinformatics tools


While database systems provide facilities to manage large data volumes, many
database systems only partially provide support for the numeric computations required to  perform statistical assessment of scientific data and therefore require further development.  This shortcoming limits the use of database systems by scientific users. The integration of  numerical algebraic calculations enables to perform automatic optimization of entire  computations, with the resulting benefits of query optimization, algorithm selection and  data independence becoming available to computations on scientific databases. This  removes the barrier between the database  system and the computation allowing the  database optimizer to manipulate a larger portion of the application.
 
Algebraic Optimization of Computations.

A pioneering work of Wolniewicz and Graefe extends the concept of database
query and show on a case study how numeric computation over time series data can be  implemented effectively in scientific databases. This frees the user from concerns about  ordering of computation steps, algorithm selection, use of indices and other physical  properties such as data distribution. The authors developed a scientific optimizer using the  “Volcano” optimizer generator, which could perform logical transformations and physical  algorithm selection.

In the optimization of scientific computations, the identification of suitable
transformation rules is of central importance. Once applicable transformation rules have  been found and applied to generate equivalent logical expressions, the optimizer must find  a set of physical algorithms that can implement or execute each expression. For instance, a  join operator can be implemented as either a merge- or hash-based algorithm, while an  interpolation can be implemented by any of  variety of curve fitting algorithms. Other  query optimizer issues include limiting  the search space, detecting common subexpressions and improving cost estimation.
Time series can be viewed as sets there each record is assumed to be tagged with a
time value. Start and stop time, the average and maximum difference between samples,  and whether or not the time differences are constant between adjacent records are all  important for some operations. Spectra are treated in a manner very similar to time series,  but with a frequency attribute attached to each record rather than a time value.

The  operators supported within the system are divided into two groups, logical and physical.  The user’s original computation is composed of logical operators, while the final  execution plan is expressed to the database  in terms of physical  operators. The authors  demonstrate that besides standard relational  operators as “select”, “project” and “join”  other necessary operators can be included such as “a random sampling operator”, “digital  filtering procedure”, “recomputation of the value of a single record based upon an  averaging function applied to nearby records”, “interpolations”  and “extrapolation”  operators, operator for “merging two time series”, as well as “spectral filter” operator. 

The authors also include simple math function applications such as “correlation”,
“convolution” and “deconvolution” of spectra. Additionally, physical operators
implemented as iterators are included such as “pass a window over a sorted data set”, “fast  Fourier Transform” and some other. It should be noted that some operators (e.g. Fourier  transform of the time series and spectra)  are “expensive” operators to perform and,  therefore, the decision on when to move between normal and Fourier-space is important  for optimization process. Logical transformations for scientific operators are also vital to  the application of optimization. There are  a number of transformations in specific  scientific domains that are considered valid and thus needs to be implemented.

Moreover,  the user should be able to enable or disable certain transformations to control the accuracy  of the results of the equation. The authors conclude that it is very beneficial to remove the  barrier between the database system and scientific computations over the database. Thus,  automatic optimization of integrated algebras is a crucial step in supporting scientific  computations over database systems.


Sequence Retrieval System (SRS)


The high amount and diversity of data available in the large number of different
databases to the scientific community is to  the user’s advantage but it also creates the  problem of knowing exactly where to get specific information. Another problem is that  different databases with different contents also use different formats adding further  technical difficulties to the already complex task of accessing and exploiting the data.

Even though many databases have developed linking systems to relate to data stored
elsewhere, these links are difficult to use due to the differences between individual
database implementations. With the expansion of databases, quick access to their contents  has become an important issue. Different  technical solutions are used or are under  development at the major bioinformatics sites. The major contribution to these field was  development of Sequence Retrieval System (SRS) at EBI. SRS has addressed many of the  difficulties in database access, integration of  databases and analysis tools. SRS is an  integration system for both data retrieval and data analysis applications. It provides, under  a uniform and simple to use interface, a high degree of linking between the databases in  the system. This allows the user to perform simple and complex queries across different  databases. Its original way of linking makes it possible to query across databases that even  do not contain direct cross-references to each other.

The user even is able to ask questions  like “Give me all the proteins that share InterPro domains with my protein” or “Give me  all the known 3D structures of this set of proteins”. The SRS server at EBI  (http://srs.ebi.ac.uk) contains more than 140 biological databases (including sequence and  sequence related databases, bibliography, metabolic pathways, 3D structure and many  other databases) and integrates many analysis tools. Results of such analyses are  themselves indexed as SRS databases and can thus be linked to others by using predefined or user-defined SRS views.

Overall, the EBI and other major world  bioinformatics centers provide a whole
range of analysis tools for databases. There  are currently more than 60 distinct services  available at the EBI such as a range of sequence homology and similarity algorithms like  (FASTA, BLAST and Smith-Waterman), sequence analysis tools (many European  Molecular Biology Open Software Suite applications), gene and structural prediction  methods





SWISSPROT

UniProtKB/Swiss-Prot is the manually annotated and reviewed section of the UniProt Knowledgebase (UniProtKB). It is a high quality annotated and non-redundant protein sequence database, which brings together experimental results, computed features and scientific conclusions. Since 2002, it is maintained by the UniProt consortium and is accessible via the UniProt website.

The most important source of information on protein sequences are the Swiss-Prot
+ TrEMBL protein sequence databases (http://www.expasy.ch/sprot/). The Swiss-Prot  protein knowledgebase is an annotated  protein sequence database, maintained  collaboratively by the Swiss Institute of Bioinformatics and the EBI. It strives to provide  sequences from all species, combined with a high level of manual annotation, a minimal  level of redundancy and a high level of integration with other biomolecular databases. To  make new protein sequences available for  the public as quickly as possible without  relaxing the high annotation standards of Swiss-Prot, the EBI provides a complement to  Swiss-Prot known as TrEMBL. TrEMBL consists of computer-annotated entries derived  from the translation of all coding sequences in the DDBJ/EMBL/GenBank Nucleotide  Sequence Database.

The TrEMBL section of UniProtKB was introduced in 1996 in response to the increased dataflow resulting from genome projects. It was already recognized at that time that the traditional time- and labour-consuming manual annotation process which is the hallmark of Swiss-Prot could not be broadened to encompass all available protein sequences. Publicly available protein sequences obtained from the translation of annotated coding sequences in the EMBL-Bank/GenBank/DDBJ nucleotide sequence database are automatically processed and entered in UniProtKB/TrEMBL where they are computed-annotated in order to make them swiftly available to the public.

UniProtKB/TrEMBL contains high quality computationally analyzed records that are enriched with automatic annotation and classification. These UniProtKB/TrEMBL unreviewed entries are kept separated from the UniProtKB/Swiss-Prot manually reviewed entries so that the high quality data of the latter is not diluted in any way.


A well-defined manual curation process is essential to ensure that all manually annotated entries are handled in a consistent manner. This process consists of 6 major mandatory steps: (1) sequence curation, (2) sequence analysis, (3) literature curation, (4) family-based curation, (5) evidence attribution, (6) quality assurance and integration of completed entries. Curation is performed by expert biologists using a range of tools that have been iteratively developed in close collaboration with curators.

(1) Sequence curation. Once a protein sequence has been selected for manual annotation on the basis of our curation priorities, Blast searches are run against UniProtKB to identify additional sequences from the same gene and to identify homologs. Sequences from the same gene and the same organism are merged into a single entry. Discrepancies between sequence reports are identified and the underlying causes such as alternative splicing, natural variations, frameshifts, incorrect initiation sites, incorrect exon boundaries and unidentified conflicts are documented. Further errors can be found by comparing homologous sequences. These steps ensure that the sequence described for each protein in UniProtKB/Swiss-Prot is as complete and correct as possible and contribute to the accuracy and quality of further curation steps.

Where do the UniProtKB protein sequences come from?
What are UniProtKB's criteria for defining a CDS as a protein?
What is the canonical sequence? Are all isoforms described in one entry?
(2) Sequence analysis. Sequences are analyzed using a range of selected sequence analysis tools. Computer-predictions are manually reviewed and relevant results selected for integration. Sequence annotation predictions include post-translational modifications, subcellular location, transmembrane domains and protein topology, domain identification and protein family classification.

(3) Literature curation. Journal articles provide the main source of experimental protein knowledge. Relevant publications are identified by searching literature databases, such as PubMed, and using literature mining tools. The full text of each paper is read and information is extracted and added to the entry. All experimental findings and authors' statements are compared both with the current knowledge on related proteins and the results from various protein sequence analysis tools. Annotation captured from the scientific literature includes protein and gene names, function, catalytic activity, cofactors, subcellular location, protein-protein interactions, patterns of expression, diseases associated with deficiencies in a protein, locations and roles of significant domains and sites, ion-, substrate- and cofactor-binding sites, catalytic residues, the variant protein forms produced by natural genetic variation, RNA editing, alternative splicing, proteolytic processing, and post-translational modification. Relevant Gene Ontology (GO) terms are assigned based on experimental data from the literature.

References
On what basis are literature references inserted in UniProtKB/Swiss-Prot entries?
(4) Family-based curation. Reciprocal Blast searches and phylogenetic resources are used to identify putative homologs, which are evaluated and curated. Annotation is standardized and propagated across homologous proteins to ensure data consistency.

How is orthology established in UniProtKB/Swiss-Prot?
How is protein family membership assigned in UniProtKB?
(5) Evidence attribution. All information added to an entry during the manual annotation process is linked to the original source so that users can trace back the origin of each piece of information and evaluate it.

Non-experimental qualifiers
Evidence on protein existence
(6) Quality assurance, integration and update. Each completed entry undergoes quality assurance before integration into UniProtKB/Swiss-Prot and is updated as new data become available.

PIR

The Protein Information Resource (PIR) is an integrated public bioinformatics resource to support genomic, proteomic and systems biology research and scientific studies (Wu et al., 2003).

PIR was established in 1984 by the National Biomedical Research Foundation (NBRF) as a resource to assist researchers in the identification and interpretation of protein sequence information. Prior to that, the NBRF compiled the first comprehensive collection of macromolecular sequences in the Atlas of Protein Sequence and Structure, published from 1965-1978 under the editorship of Margaret O. Dayhoff. Dr. Dayhoff and her research group pioneered in the development of computer methods for the comparison of protein sequences, for the detection of distantly related sequences and duplications within sequences, and for the inference of evolutionary histories from alignments of protein sequences.

Dr. Winona Barker and Dr. Robert Ledley assumed leadership of the project after the untimely death of Dr. Dayhoff in 1983. In 1999 Dr. Cathy H. Wu joined NBRF, and later on Georgetown University Medical Center (GUMC), to head the bioinformatics efforts of PIR, and has served first as Principal Investigator and, since 2001, as Director.

For over four decades, beginning with the Atlas of Protein Sequence and Structure, PIR has provided protein databases and analysis tools freely accessible to the scientific community including the Protein Sequence Database (PSD).

In 2002 PIR, along with its international partners, EBI (European Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics), were awarded a grant from NIH to create UniProt, a single worldwide database of protein sequence and function, by unifying the PIR-PSD, Swiss-Prot, and TrEMBL databases.

In 2009 Dr. Wu accepted the Edward G. Jefferson Chair of Bioinformatics and Computational Biology at the University of Delaware (UD).

Today, PIR maintains staff at UD and GUMC and continues to offer world leading resources to assist with proteomic and genomic data integration and the propagation and standardization of protein annotation.

(Go through the below link and download the different PIR databases)




KEGG

(Refer the following UNIT VII for KEGG material in this site)