Introduction to Biochemical databases,organization and management of databases
The official definition provided by DAMA International, the professional organization for those in the data management profession, is: "Data Resource Management is the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise."{{DAMA International}} This definition is fairly broad and encompasses a number of professions which may not have direct technical contact with lower-level aspects of data management, such as relational database management.
Alternatively, the definition provided in the DAMA Data Management Body of Knowledge (DAMA-DMBOK) is: "Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets."
The concept of "Data Management" arose in the 1980s as technology moved from sequential processing (first cards, then tape) to random access processing. Since it was now technically possible to store a single fact in a single place and access that using random access disk, those suggesting that "Data Management" was more important than "Process Management" used arguments such as "a customer's home address is stored in 75 (or some other large number) places in our computer systems." During this period, random access processing was not competitively fast, so those suggesting "Process Management" was more important than "Data Management" used batch processing time as their primary argument. As applications moved more and more into real-time, interactive applications, it became obvious to most practitioners that both management processes were important. If the data was not well defined, the data would be mis-used in applications. If the process wasn't well defined, it was impossible to meet user needs.
The biological sciences encompass an enormous variety of information, from the environmental sciences, which give us a view of how species live and interact in a world filled with natural phenomena to cell biology, which provide knowledge about the inner structure and function of the cell and beyond. All this information requires classification, organization and management. Biological data exhibits many special characteristics that
make management of biological information a particularly challenging problem. A multidisciplinary field called bioinformatics has emerged recently to address information management of genetic information with special emphasis on DNA and protein sequence analysis. However, bioinformatics harness all other types of biological information and the modeling, storage, retrieval, and management of that information. Moreover, applications of bioinformatics span new drug target validation and development of novel drugs, study of mutations and related diseases, anthropological investigations on migration patterns of tribes and therapeutic treatments.
Databases of intermediary metabolism, and indeed of biochemistry generally, offer computational challenges and opportunities to reorganize biological knowledge to facilitate exploration. Here I consider a simple case, that of the classification of enzymatic reactions, and show how the classification could be automated and extended using deductive technology.
Technological advances in high-throughput techniques and efficient data acquisition methods have resulted in a massive amount of life science data. The data is stored in numerous databases that have been established over the last decades and are essential resources for scientists nowadays. However, the diversity of the databases and the underlying data models make it difficult to combine this information for solving complex problems in systems biology. Currently, researchers typically have to browse several, often highly focused, databases to obtain the required information. Hence, there is a pressing need for more efficient systems for integrating, analyzing, and interpreting these data. The standardization and virtual consolidation of the databases is a major challenge resulting in a unified access to a variety of data sources.
Access to Multiple Databases
- DBGET (requires graphics)
- Entrez (DNA/RNA + Protein + Structures + Medline subset)
- LabonWeb
- NCBI multiple database access
- SRS (EMBL, SwissProt, PIR, PDB, Prosite...)
DNA & RNA Sequences
- DoubleTwist
- Genbank
- Genome Sequence Database (GSDB)
- EMBL Datalibrary
- @ EMBNET-Switzerland Last full release & updates
- DbEST (cDNA fragments)
- Complete Organellar Genomes
- Vector db -- a sequence database of recombinant DNA vectors
DNA & RNA Motifs, Sites, etc
- REBASE, the restriction enzyme database
- TRANSFAC database on eukaryotic cis-acting regulatory DNA elements and trans-acting factors.
Protein Sequences
- SwissProt
- PIR
- OWL non-redundant protein sequence database
- PIR-NRL 3D (Protein Sequence-Structure database)
Protein Motifs & Patterns
- BLOCKS (Protein Motifs)
- PRINTS Protein Motif Fingerprint Database
- PRODOM Protein Domain Server
- Prosite
- PUMA: Phylogenies, Metabolism, and Alignment
- SBASE (Protein Domains)
KEGG
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The PATHWAY database records networks of molecular interactions in the cells, and variants of them specific to particular organisms. As of July 2011, KEGG has switched to a subscription model and access via FTP is no longer free.
Introduction
The KEGG, the Kyoto Encyclopedia of Genes and Genomes, was initiated by the Japanese human genome programme in 1995. According to the developers they consider KEGG to be a "computer representation" of the biological system. The KEGG database can be utilized for modeling and simulation, browsing and retrieval of data. It is a part of the systems biology approach.
KEGG maintains five main databases:
§ KEGG Atlas
§ KEGG Pathway
§ KEGG Genes
§ KEGG Ligand
§ KEGG BRITE
Databases
KEGG connects known information on molecular interaction networks, such as pathways and complexes (this is the Pathway Database), information about genes and proteins generated by genome projects (including the gene database) and information about biochemical compounds and reactions (including compound and reaction databases). These databases are different networks, known as the protein network, and the chemical universe respectively. There are efforts in progress to add to the knowledge of KEGG, including information regarding ortholog clusters in the KO (KEGG Orthology) database.
KEGG Pathways:
§ Genetic Information Processing
§ Environmental Information Processing
§ Cellular Processes
Ligand Database:
§ Compound
§ Drug
§ Glycan
§ Reaction
§ RPAIR (Reactant pair alignments)
§ Enzyme
KEGG –TABLE OF CONTENTS
| |||||
| Category | Entry Point | Release Info | Search & Compute | DBGET Search |
| |||||
| Systems information | ||||
| |||||
| |||||
| Genomic information | | |||
| |||||
| |||||
| Chemical information | | |||
| |||||
See Kanehisa et al. (2012) for the new features of KEGG. |
KEGG for specific organisms
KEGG mapping for genome comparsion and combination
KEGG as an integrated web resource
KEGG for computational analysis
KEGG for software development
KEGG web links - URLs for linking to the KEGG website |
Desktop applications for utilizing KEGG
BRENDA
BRENDA (BRaunschweig ENzyme DAtabase) is an enzyme information system representing one of the most comprehensive enzymerepositories.
Introduction
BRENDA is an electronic information resource that comprises molecular and biochemical information on enzymes that have been classified by theIUBMB. Every classified enzyme is characterized with respect to its catalyzed biochemical reaction. Kinetic properties of the correspondingreactants; i.e., substrates and products are described in detail. BRENDA provides a web-based user interface that allows a convenient and sophisticated access to the data. BRENDA was founded in 1987 at the former German National Research Centre for Biotechnology (now: Helmholtz Centre for Infection Research) in Braunschweig and was originally published as a series of books. From 1996 to 2007, BRENDA was located at the University of Cologne. There, BRENDA developed into a publicly accessible enzyme information system. In 2007, BRENDA returned to Braunschweig. Currently, BRENDA is maintained and further developed at the Department of Bioinformatics and Biochemistry at the TU Braunschweig.
BRENDA contains enzyme-specific data manually extracted from primary scientific literature and additional data derived from automatic information retrieval methods such as text mining.
A major update of the data in BRENDA is performed twice a year. Besides the upgrade of its content, improvements of the user interface are also incorporated into the BRENDA database.
The latest update was performed in July 2010.
Content and Features
Database:
The database contains more than 40 data fields with enzyme-specific information on more than 4800 EC numbers that are classified according to the IUBMB. The different data fields cover information on the enzyme's nomenclature, reaction and specificity, enzyme structure, isolation and preparation, enzyme stability, kinetic parameters such as Km value and turnover number, occurrence and localization, mutants and engineered enzymes, application of enzymes and ligand-related data. The data originates from almost 85,000 different scientific articles. Each enzyme entry is clearly linked to at least one literature reference, to its source organism, and, where available, to the protein sequence of the enzyme. Furthermore, cross-references to external information resources such assequence and 3D-structure databases, as well as biomedical ontologies, are provided.
Extensions:
Since 2006, the data in BRENDA is supplemented with information extracted from the scientific literature by a co-occurrence based text mining approach. For this purpose, two text-mining repositories FRENDA (Full Reference ENzyme DAta) and AMENDA (Automatic Mining of ENzyme DAta) were introduced. These text-mining results were derived from the titles and abstracts of all articles in the literature database PubMed
Data access:
There are several tools to obtain access to the data in BRENDA. Some of them are listed here.
§ Several different query forms (e.g., quick and advanced search)
§ Chemical substructure search engine for ligand structures
Availability
The usage of BRENDA is free of charge. In addition, FRENDA and AMENDA are free for non-profit users. Commercial users are in need of a license for these databases.
Other databases
BRENDA provides links to several other databases with a different focus on the enzyme, e.g., metabolic function or enzyme structure. Other links lead to ontological information on the correspondinggene of the enzyme in question. Links to the literature are established with PubMed. BRENDA links to some further databases and repositories such as:
§ ExPASy
§ KEGG
§ PROSITE
§ SCOP
§ CATH
§ InterPro
§ ChEBI
§ Uniprot
New BRENDA release online since December 2011
|
ERGO
|