A Text Book On BIOINFORMATICS -BY ZAHOORULLAH S MD: UNIT VII BIOCHEMICAL DATABASE

Introduction to Biochemical databases,organization and management of databases

The official definition provided by DAMA International, the professional organization for those in the data management profession, is: "Data Resource Management is the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise."{{DAMA International}} This definition is fairly broad and encompasses a number of professions which may not have direct technical contact with lower-level aspects of data management, such as relational database management.

Alternatively, the definition provided in the DAMA Data Management Body of Knowledge (DAMA-DMBOK) is: "Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets."

The concept of "Data Management" arose in the 1980s as technology moved from sequential processing (first cards, then tape) to random access processing. Since it was now technically possible to store a single fact in a single place and access that using random access disk, those suggesting that "Data Management" was more important than "Process Management" used arguments such as "a customer's home address is stored in 75 (or some other large number) places in our computer systems." During this period, random access processing was not competitively fast, so those suggesting "Process Management" was more important than "Data Management" used batch processing time as their primary argument. As applications moved more and more into real-time, interactive applications, it became obvious to most practitioners that both management processes were important. If the data was not well defined, the data would be mis-used in applications. If the process wasn't well defined, it was impossible to meet user needs.

The biological sciences encompass an enormous variety of information, from the environmental sciences, which give us a view of how species live and interact in a world filled with natural phenomena to cell biology, which provide knowledge about the inner structure and function of the cell and beyond. All this information requires classification, organization and management. Biological data exhibits many special characteristics that

make management of biological information a particularly challenging problem. A multidisciplinary field called bioinformatics has emerged recently to address information management of genetic information with special emphasis on DNA and protein sequence analysis. However, bioinformatics harness all other types of biological information and the modeling, storage, retrieval, and management of that information. Moreover, applications of bioinformatics span new drug target validation and development of novel drugs, study of mutations and related diseases, anthropological investigations on migration patterns of tribes and therapeutic treatments.

Databases of intermediary metabolism, and indeed of biochemistry generally, offer computational challenges and opportunities to reorganize biological knowledge to facilitate exploration. Here I consider a simple case, that of the classification of enzymatic reactions, and show how the classification could be automated and extended using deductive technology.

Technological advances in high-throughput techniques and efficient data acquisition methods have resulted in a massive amount of life science data. The data is stored in numerous databases that have been established over the last decades and are essential resources for scientists nowadays. However, the diversity of the databases and the underlying data models make it difficult to combine this information for solving complex problems in systems biology. Currently, researchers typically have to browse several, often highly focused, databases to obtain the required information. Hence, there is a pressing need for more efficient systems for integrating, analyzing, and interpreting these data. The standardization and virtual consolidation of the databases is a major challenge resulting in a unified access to a variety of data sources.

Access to Multiple Databases

DBGET (requires graphics)
Entrez (DNA/RNA + Protein + Structures + Medline subset)
LabonWeb
NCBI multiple database access
SRS (EMBL, SwissProt, PIR, PDB, Prosite...)

DNA & RNA Sequences

DoubleTwist
Genbank

Entrez (DNA/RNA + Protein + Medline + subset + Structures)
BankIt -- sequence submission page

Genome Sequence Database (GSDB)
EMBL Datalibrary

@ EMBNET-Switzerland Last full release & updates

DbEST (cDNA fragments)
Complete Organellar Genomes
Vector db -- a sequence database of recombinant DNA vectors

DNA & RNA Motifs, Sites, etc

REBASE, the restriction enzyme database
TRANSFAC database on eukaryotic cis-acting regulatory DNA elements and trans-acting factors.

Protein Sequences

Introduction

The KEGG, the Kyoto Encyclopedia of Genes and Genomes, was initiated by the Japanese human genome programme in 1995. According to the developers they consider KEGG to be a "computer representation" of the biological system. The KEGG database can be utilized for modeling and simulation, browsing and retrieval of data. It is a part of the systems biology approach.

KEGG maintains five main databases:

§ KEGG Atlas

§ KEGG Pathway

§ KEGG Genes

§ KEGG Ligand

§ KEGG BRITE

Databases

KEGG connects known information on molecular interaction networks, such as pathways and complexes (this is the Pathway Database), information about genes and proteins generated by genome projects (including the gene database) and information about biochemical compounds and reactions (including compound and reaction databases). These databases are different networks, known as the protein network, and the chemical universe respectively. There are efforts in progress to add to the knowledge of KEGG, including information regarding ortholog clusters in the KO (KEGG Orthology) database.

KEGG Pathways:

§ Metabolism

§ Genetic Information Processing

§ Environmental Information Processing

§ Cellular Processes

§ Human Diseases

§ Drug development

Ligand Database:

§ Compound

§ Drug

§ Glycan

§ Reaction

§ RPAIR (Reactant pair alignments)

§ Enzyme

KEGG –TABLE OF CONTENTS


	Category	Entry Point	Release Info	Search & Compute	DBGET Search

	Systems information	KEGG PATHWAY KEGG BRITE KEGG MODULE KEGG Mapper KEGG Atlas	New maps Update history New hierarchies Update history	Search Pathway Search Brite Search Module KEGG pathway maps BRITE functional hierarchies KEGG modules	PATHWAY BRITE MODULE

		KEGG DISEASE KEGG DRUG KEGG ENVIRON KEGG MEDICUS	New drug maps Update history	Human diseases Infectious diseases ATC drug classification	DISEASE DRUG ENVIRON

	Genomic information	KEGG ORTHOLOGY		KEGG Orthology (KO)	ORTHOLOGY

		KEGG GENES KEGG GENOME KEGG Organisms	New organisms Update history	SSDB search BLAST / FASTA search KAAS automatic annotation Map organisms to taxonomy Generate taxonomy tree KEGG organisms	GENES DGENES EGENES MGENES GENOME EGENOME MGENOME

	Chemical information	KEGG LIGAND KEGG COMPOUND KEGG GLYCAN KEGG REACTION		SIMCOMP / SUBCOMP search KCaM search E-zyme reaction prediction PathPred pathway prediction PathComp path computation PathSearch reaction search	COMPOUND GLYCAN REACTION RPAIR RCLASS ENZYME

See Kanehisa et al. (2012) for the new features of KEGG.

KEGG for specific organisms

KEGG mapping for genome comparsion and combination

KEGG as an integrated web resource

DBGET - for keyword search of KEGG and other databases

LinkDB - for searching outside databases linked from KEGG

KEGG for computational analysis

BLAST / FASTA - Sequence similarity search

SIMCOMP / SUBCOMP - Chemical structure similarity search

KCaM - Glycan structure similarity search

E-zyme - Enzymatic reaction prediction

PathPred / PathComp / PathSearch - Metabolic pathway prediction/computation/search

KAAS - Genome/EST annotation

KEGG EXPRESSION - Gene expression data analysis

KEGG for software development

KEGG XML - XML representation of KEGG pathways

KEGG API - SOAP/WSDL interface for the KEGG system

KEGG web links - URLs for linking to the KEGG website

Desktop applications for utilizing KEGG

KegHier - Java application for browsing KEGG BRITE

KegArray - Java application for microarray data analysis

KegDraw - Java application for drawing compound/glycan structures

BRENDA

BRENDA (BRaunschweig ENzyme DAtabase) is an enzyme information system representing one of the most comprehensive enzymerepositories.

Introduction

BRENDA is an electronic information resource that comprises molecular and biochemical information on enzymes that have been classified by theIUBMB. Every classified enzyme is characterized with respect to its catalyzed biochemical reaction. Kinetic properties of the correspondingreactants; i.e., substrates and products are described in detail. BRENDA provides a web-based user interface that allows a convenient and sophisticated access to the data. BRENDA was founded in 1987 at the former German National Research Centre for Biotechnology (now: Helmholtz Centre for Infection Research) in Braunschweig and was originally published as a series of books. From 1996 to 2007, BRENDA was located at the University of Cologne. There, BRENDA developed into a publicly accessible enzyme information system. In 2007, BRENDA returned to Braunschweig. Currently, BRENDA is maintained and further developed at the Department of Bioinformatics and Biochemistry at the TU Braunschweig.

BRENDA contains enzyme-specific data manually extracted from primary scientific literature and additional data derived from automatic information retrieval methods such as text mining.

A major update of the data in BRENDA is performed twice a year. Besides the upgrade of its content, improvements of the user interface are also incorporated into the BRENDA database.

The latest update was performed in July 2010.

Content and Features

Database:

The database contains more than 40 data fields with enzyme-specific information on more than 4800 EC numbers that are classified according to the IUBMB. The different data fields cover information on the enzyme's nomenclature, reaction and specificity, enzyme structure, isolation and preparation, enzyme stability, kinetic parameters such as Km value and turnover number, occurrence and localization, mutants and engineered enzymes, application of enzymes and ligand-related data. The data originates from almost 85,000 different scientific articles. Each enzyme entry is clearly linked to at least one literature reference, to its source organism, and, where available, to the protein sequence of the enzyme. Furthermore, cross-references to external information resources such assequence and 3D-structure databases, as well as biomedical ontologies, are provided.

Extensions:

Since 2006, the data in BRENDA is supplemented with information extracted from the scientific literature by a co-occurrence based text mining approach. For this purpose, two text-mining repositories FRENDA (Full Reference ENzyme DAta) and AMENDA (Automatic Mining of ENzyme DAta) were introduced. These text-mining results were derived from the titles and abstracts of all articles in the literature database PubMed

Data access:

There are several tools to obtain access to the data in BRENDA. Some of them are listed here.

§ Several different query forms (e.g., quick and advanced search)

§ EC tree browser

§ Taxonomy tree browser

§ Ontologies for different biological domains (e.g., BRENDA tissue ontology, Gene Ontology)

§ Thesaurus for ligand names

§ Chemical substructure search engine for ligand structures

§ SOAP interface

Availability

The usage of BRENDA is free of charge. In addition, FRENDA and AMENDA are free for non-profit users. Commercial users are in need of a license for these databases.

Other databases

BRENDA provides links to several other databases with a different focus on the enzyme, e.g., metabolic function or enzyme structure. Other links lead to ontological information on the correspondinggene of the enzyme in question. Links to the literature are established with PubMed. BRENDA links to some further databases and repositories such as:

§ BRENDA tissue ontology

§ ExPASy

§ NCBI databases (Protein, nucleotide, structure, genome, OMIM, Domains

§ IUBMB enzyme nomenclature

§ KEGG

§ PDB database (3D information)

§ PROSITE

§ SCOP

§ CATH

§ InterPro

§ ChEBI

§ Uniprot

New BRENDA release online since December 2011

Nomenclature	Reaction & Specificity	Functional Parameters
Enzyme Names EC Number Common/ Recommended Name Systematic Name Synonyms CAS Registry Number	Pathway Catalysed Reaction Reaction Type Natural Substrates and Products Substrates and Products Substrates Natural Substrate Products Natural Product Inhibitors Cofactors Metals/Ions Activating Compounds Ligands Biochemicals Reactions Aligned	Km Value kcat/Km Value Ki Value IC50 Value pI Value Turnover Number Specific Activity pH Optimum pH Range Temperature Optimum Temperature Range
Isolation & Preparation
Purification Cloned Expression Renatured Crystallization		Organism-related information
		Organism Source Tissue Localization Protein-Specific Search
Stability	Enzyme Structure	Disease & References
pH Stability Temperature Stability General Stability Organic Solvent Stability Oxidation Stability Storage Stability	Sequence/ SwissProt link 3D-Structure/ PDB link Molecular Weight Subunits Posttranslational Modification	Disease/ Diagnostics References
		Application & Engineering
		Engineering Application

ERGO

Since 1995, over 200 microbial organisms have been completely sequenced (1). Since the first sequence, it has become evident that the single most important tool for interpretation of the new genome sequence is a through analysis and the integration for comparative genomics. The success of the comparative analysis is directly dependent on the efficiency of integration, which in turn is determined by the diversity of the organisms, high quality annotations, and the level of detailed cellular reconstructions.

Integrated Genomics, Inc. has designed the ERGO™ bioinformatics suite in order to accommodate such data integration, to provide the tools necessary to support the comparative analysis of genomes and the generation of sophisticated metabolic and cellular reconstructions (2). Emerging from PUMA and WIT, which were previously developed at Argonne National Laboratories (3,4), ERGO™ is a third generation bioinformatics suite offered exclusively from Integrated Genomics at:http://ergo.integratedgenomics.com/ERGO/.

The ERGO system represents the development of a genome analysis strategy into a multi-dimensional environment, which supports both automatic and manual genome-wide curation. Rather than just repackaging known information, ERGO integrates genomic information with biochemical data, literature, and high-throughput analysis into a comprehensive user-friendly network of metabolic and non-metabolic pathways. In contrast to conventional systems, the ERGO user can take into account sequence similarity, protein and gene context clustering, occurrence profiles, regulatory and expression data, as well as functional hierarchies in order to achieve a set of the best possible functional predictions. In fact, using the ERGO system, a major part of the metabolism of an organism, can be reconstructed entirely in silico (5). The cyclical nature of the integration of these information types continually elevates our knowledge and understanding of the complex dynamics residing in living organisms.

The current version of the ERGO™ database contains 618 complete or nearly complete genomes, of which 319 are Bacteria, 116 Eukarya, 34 Archaea and 149 Viruses (Figure 1). In total, these genomes contain over 1,300,000 Open Reading Frames (ORFs), more than 60% of which have a functional annotation. This percentage of annotated genes is actually much higher for the bacterial genomes, reaching an average of 70%. Every genome that goes into the ERGO system, is annotated from scratch whether it has been sequenced at Integrated Genomics, or at another sequencing center. More than 450 of the genomes are available for subscription or as part of a stand-alone ERGO server package from Integrated Genomics.

Figure 1. Phylogenetic distribution of the number of complete (red bars) and gapped (white bars) genomes, integrated in the developers ERGO™ bioinformatics suite.

The ERGO system integrates many different types of data, summarized in Table 1. These include genomic and pathway related data (both metabolic and non-metabolic pathways), regulatory data, as well as 'proteomics' data such as gene essentiality and expression data. The genomic data include genome contigs, locations of ORFs and their translations, locations of RNAs, locations of insertion elements, functional assignments (along with their history records) and a number of proprietary gene clustering tools. The primary tools involve clustering of the ORFs according to sequences similarity (i.e. ortholog, paralog and protein clusters) or gene context (i.e. chromosomal and fusion clusters). The ortholog clusters are essentially bi-directional best hits across different genomes, while paralog clusters are homologs within the same genome. Protein family clustering represents a new clustering technology being developed at IG. It is based on the highly manually curated ORF database of ERGO and is an attempt to produce protein families where all ORFs share strong sequence homology and have the same predicted function. More than 60% of the ORFs in ERGO are currently connected to these sets of clusters. Protein clustering information is only available through subscription or purchase of a stand-alone ERGO server. The principal of chromosomal and fusion clustering, and their importance in function prediction has been previously reported (6,7).

Table 1. Summary of Data types in ERGO

Genomic Data

DNA sequence data into contigs (from over 400 genomes)
ORFs and their Location (Graphical visualization of ORFs on a contig)
Translation of ORFs
Pre-computed sequence similarities for each ORF (against the entire database)
Functional assignments of proteins (with their history records)
RNA assignments
Identification and localization of insertion elements (ISs)
Ortholog clusters
Paralog clusters
Protein family clusters
Chromosomal clusters
Fusion clusters

Pathway Data

Chemical structures
Enzyme records
Metabolic pathways
Non-metabolic pathways
Cellular overviews (networks of metabolic and non-metabolic pathways)
Functional hierarchies (Functional roles organized into Gene Ontologies

Regulatory Data

Essentiality Data

Expression Data

Loading a new genome in ERGO

In order to incorporate a genome into ERGO™, all the potential ORFs must first be identified. There are numerous tools now available to support identification of genes, and better tools are constantly being developed. The reason for the rapid advances is clear: many of the decisions required in gene identification are based on sequence similarity to previously identified genes, and the set of well-annotated genomes is growing rapidly. As the pool of characterized genes grows, algorithms that exploit this information will produce more consistent and accurate predictions. Gene identification within prokaryotic genomes is substantially more reliable than within eukaryotic genomes; the absence of introns and the fact that the sequence is often of higher quality are undoubtedly major factors. The remaining problems relate to choice of start positions, detection of frameshifts, and identification of short genes. These problems can reasonably be viewed as significant, but relatively minor. Integrated Genomics has developed proprietary software to address and overcome these problems, which currently works quite efficiently.

The DNA sequence and its putative coding regions are then loaded into ERGO™ as a package. The installation follows a standard protocol which includes (i) de novo calculation of sequence similarities for all the newly predicted ORFs against the entire non-redundant set of ORFs (over 2 million) in ERGO using the FASTA algorithm and (ii) re-calculation of clusters based on sequence (i.e. orthologs, paralogs, protein clusters), or gene context (i.e. chromosomal 'neighborhoods' and fusion events) for the entire database. The installation is followed by a multi-step annotation routine established at IG and applied for the annotation of more than 200 genomes. The process culminates with building of metabolic reconstruction models, which are represented by wire diagrams of subsystems and cellular pathways connected to gene sequences.

Genome wide Functional Annotation

Genome annotation consists of a series of automated and manual procedures carried out by a combination of sophisticated algorithms and an experienced team of professional annotators at Integrated Genomics. The main steps are reflected on Figure 2. In general, there are three stages for functional assignments of a given genome, two stages before the metabolic reconstruction is completed, and one stage after.

Figure 2. Genome curation steps at the ERGO™ bioinformatics suite.

Initial Automated Functional Annotations

The automatic annotations in ERGO™ are the culmination of a multi-step process approach, parts of which are summarized on Table 2.

Table 2. First round of analysis: Automatic annotation steps

Automatic annotations steps
Identify RNA genes	This is a semi-automated procedure. Currently we identify only tRNAs and rRNAs.
Identify protein-encoding genes	See above
Estimate phylogenetic distances	We attempt to determine an approximate position in the phylogenetic tree, the organisms closest neighbors, and estimate the distance between the organism and others in the tree. This is used at several points within the analysis that follows.
Calculate protein scores	Compute similarities of all ORFs of the query genome against the ERGO non-redundant database
Compute bi-directional best hits (BBHs) between protein-encoding genes	A gene X in organism Gx is a bi-directional best hit of a gene Y in organism Gy iff X is the closest gene to Y in Gx and Y is the closest gene to X in Gy. These are used constantly to attempt to find corresponding genes in distinct genomes. There is a broad literature on the use and misuse of BBHs.
Compute pairs of close bi-directional best hits (PCBBHs).	These are used heavily in the computation of 'functional coupling based on chromosomal clusters'.
Compute pairs of close homologs (PCHs).	Same as above
Compute pinned orthologs.	Pinned orthologs are used in constructing the 'pinned regions' displays in ERGO.
Compute preserved operons.	Preserved operons are estimates of sets of related genes that are clustered into what appears to be 'an operon', but no real assertion is being made of co-regulation.
Compute paralog families.	These are important in classes of genes like transposases, transporters, regulation proteins, and 2-component signaling systems.
Compute protein families.	We consider a family to be a set of homologous proteins with the same non-hypothetical function (obviously, both the homology and identity of function are estimates).
Compute profiles (of both families and organisms).	Profiles are not widely used at this stage. ERGO provides tools to compute which families act as signatures for sets of organisms based on 'family profiles'.
Compute 'spreadsheets' between closely-related genomes.	This determines both core functionality, and sets of genes which are local to subsets of the genomes.

Manual Functional Annotations

Manual annotations can lead to far more accurate and detailed predictions, than any automatic tools. Although, it has been argued that manual analysis cannot be adequate to cover the extraordinary large volumes of sequence data produced, Integrated Genomics has achieved this through a combination of a systematic approach and the integration of the data into a single system. The systematic approach includes a manual inspection of the automatically assigned functions, as well as an exhaustive manual study of every single gene, by employing the combined use of both proprietary and publicly available tools (Table 3). Here, all questionable assignments and cases of weak homology will be evaluated using sequence similarity search tools.

Table 3. Second round of analysis: Manual annotation steps

Manual annotation steps
Examine un-annotated ORFs with strong hits to ORFs with functions	At the end of the automatic annotation round, genes may remain without a function, although they do have a strong sequence similarity to other genes of know function. This happens when the program encounters cases of equally strong hits to different functions. Therefore, a more detailed analysis is needed here to distinguish between the alternative options.
Reconcile models	This step entails a manual inspection of all the differences between the automatic annotations in ERGO, and those generated by motif/domain databases like Pfam, COGs, InterPro, etc. This step increases both the coverage and the accuracy.
Examine gene context	Examine the physical layout of the genes on the chromosome looking for previously unrevealed functional relations between genes.
Examine Paralog families	Examine the ORFs that are part of paralog families, and remain un-annotated. Several such cases can be annotated with general family names, rather than assigning exact specificity.

Since functional annotations have been traditionally based on similarity to genes of known function, ERGO provides online access to sequence similarity tools such as BLASTP or PSI-BLAST searches that are submitted to the NCBI server. In addition to these, queries can be submitted to more sensitive sequence similarity search tools such as the motif/pattern databases Pfam, Prosite, Prodom, InterPro and COGs.

One of ERGOs most significant features is its comparative annotations environment that provides quality checks for both the automatic annotations and manual analysis. To this end, a user may request to compare all different annotations available for the genes of a particular genome (Table 3). These annotations come either from other users of the ERGO system or from external databases (whose annotations have been already integrated into ERGO). Whenever possible, all the function predictions from SwissProt and TrEmbl, or PIR are included for the genomes, as well as those based on Pfam and COGs. In addition to this,

Furthermore, as the number of genome sequences grows, we have incorporated additional methods that rely on gene context rather than on sequence similarity (7,11). Based on the tendency of functionally related bacterial genes to cluster along the chromosome, it is now possible to extend our ability to predict functions beyond sequence similarity (11). We calculated such 'chromosomal clusters' based on bi-directional best-hit algorithm for all the otholog genes throughout all the genomes in ERGO database. Since only 1/3 of genes are clustered this way in an average bacterial genome, a large number of genomes are needed for the method to work. With ERGO content of more than 350 prokaryotic genomes, chromosomal clusters coupled with organism-specific functional pathways became a powerful tool for predicting functions for 'missing' genes (and gene families) and genes with weak homology. One can suggest a functional role for an unknown ORF by cross-referencing of the chain of biochemical reactions with an ORF cluster in any genome. A similar approach is this of the gene-fusion, which is based on the observation that often two or more ORFs that are separate ("components") in one organism have their orhologs fused as a single protein (being a "composite") in another one (7). Such fusions sometimes yield a functional clue for unknown 'components': if one of the separate ORFs does not have a known function, perhaps it is related to its 'Siamese twin' domain with known function and visa versa. Gene fusions are particularly important for eukaryotes where up to 55% of all genes can be fused in a given genome (A. thaliana, C. elegans).

Overall, the combined use of the above tools, along with detailed multi-step manual curation supported by the ERGO system, results in a significant increase in the function prediction .

Automated pathway Annotations

Once the function is predicted confidently, it may then be connected to a particular metabolic or cellular pathway, which already exists in the ERGO™ pathway collection. The level of detail and coverage at this step is directly related to the number of pathways present in the ERGO system. Over the years, IG has been compiling a database of functional pathways dubbed IG-Pathdb. Now, it contains over 5,000 cellular pathways (the majority of which are metabolic) and new ones are being added daily. Each metabolic pathway entry stores information about metabolites, reactions, and corresponding enzymatic functions. The non-metabolic pathways, unlike the metabolic ones, represent either lists of functionally related genes (i.e. genes of the large ribosomal subunit, or genes of the type IV protein secretion) or general lists of process related functions (i.e. general transcription activators, or Phage proteins). Most of the pathways were extracted from the experimental literature and connected to specific gene sequences at the genomes at ERGO database. With improvements in annotation technology, many pathways are now deduced from the sequenced genomes directly, using the metabolite compounds as connecting nodes and a set of rules. When a genome with annotated ORFs is added to ERGO, a set of pathways will be automatically assigned to the organism based on a collection of pathway templates. During this automated step, only the pathways with all the functional roles connected to at least one gene will be assigned (via the functions already assigned). Each function can be connected to a number of different or alternative pathways

Manual Pathway Annotations

At the second round, an expert user can manually perform a 'reality check' to the set of asserted pathways (particularly, to the alternative ones), or assert additional ones, according to the literature data concerning the organisms 'life style', as well as its biochemistry and genetics. Cellular pathways are connected into larger functional subsystems, such as amino-acid metabolism, oxidative phosphorylation, lipid metabolism, secretion, etc. This is partially automated task done at IG by professional curators specializing in particular subsystems. Based on their expert knowledge of a sybsystem (with all the alternatives among hundreds of organisms in ERGO), curators first look for the asserted pathways from their functional subsystem. Then they determine the set of pathways which must be found in the organism under study because they are essential for the organism. These additional pathways will not have previously been asserted because of missing gene associations with one or more of the functional roles in the pathway. Once identified, these pathways can be used to find those missing functions that escaped the initial similarity based analyses.

Detailed Function Annotations

The detailed knowledge of every step in our collection of metabolic pathways allows us to identify the missing steps of the pathway for a particular organism. We then go back into a third round of manual annotations, and try to predict these missing steps. This brings us to the third and final step of annotations, which entails a directed and reversed (as compared to the first two rounds) approach. Along this highly laborious step, the query is the function predicted to be present, and the target is the gene, which now is expected to be identified, as opposed to the first two rounds where the query was the gene that had been predicted to exist and the target was the function that remained unidentified. If most (or some) functions in a given pathway are connected to genes which are neighbors on the chromosome, then that may yield a functional clue: if one if the ORFs in this neighborhood is without assigned function, then perhaps it has the function that in the pathway that has no genes connected.

Table 4. Third round of analysis: Manual annotation steps

Focused manual annotation steps
Pathway assertions	Using a combination of automatic tools and manual analysis, all possible cellular pathways are asserted to an organsim
Identification of 'missing functions'	Identification of the functional roles in pathways that are asserted, for which no gene was identified
Verify that functions match with the functional roles of the pathways (Controlled vocabulary cleanup)	Check for instances in which functions were assigned, but the functions do not connect to existing pathways/subsystems (this often leads to the addition of more pathways or subsystems and more accurate annotations).
Examine gene context	Examine the physical layout of the genes on the chromosome looking for previously unrevealed functional relations between neighboring genes.
Search for possible unidentified genes	If all the above steps fail, examine the physical layout of genes on the chromosome looking for unacceptable overlaps between genes or unusually long gaps between genes.

Thus, the combination of cellular pathways and gene context tools available in ERGO, provide an ideal framework not only to identify and connect all possible functions to genes, but also to predict which functions should also be present and further facilitate the discovery of their corresponding genes (Figure 3).

Figure 3. Schematic representation of the process of function identification based on the combination of tools related to gene neighborhood and metabolic pathways in ERGO.