A Text Book On BIOINFORMATICS -BY ZAHOORULLAH S MD: UNIT IV PRIMARY DATABASE INFORMATION

Primary database Information

Introduction to Biological databases,organization and management of databases

The biological sciences encompass an enormous variety of information, from the

environmental sciences, which give us a view of how species live and interact in a world illed with natural phenomena to cell biology, which provide knowledge about the inner tructure and function of the cell and beyond. All this information requires classification, rganization and management. Biological data exhibits many special characteristics that ake management of biological information a particularly challenging problem. A Multidisciplinary field called bioinformatics has emerged recently to address information management of genetic information with special emphasis on DNA and protein sequence analysis. However, bioinformatics harness all other types of biological information and the modeling, storage, retrieval, and management of that information. Moreover, applications of bioinformatics span new drug target validation and development of novel drugs, study of mutations and related diseases, anthropological investigations on migration patterns of tribes and therapeutic treatments.

Specific features of biological data

1. Biological data is highly complex when compared with most other domains or

applications. Definitions of such data must thus be able to represent a complex

substructure of data as well as relationships and to ensure that information is not lost during biological data modeling. Biological information systems must be able to represent any level of complexity in any data schema, relationship, or schema substructure. A good example of such system is MITOMAP database documenting the human mitochondrial genome (http://www.mitomap.org). The database information include data and their relationship about ca. 17,000 nucleotide bases of the mitochondrial DNA; 52 gene loci encoding mRNAs, rRNAs and tRNAs; over 1,500 known population variants and over 60 disease associations. MITOMAP includes links to over 3,000 literature references. Traditional RDBMS or ODBMS are unable to capture all aspects of the database data.

2. The amount and range of variability in biological data is high. Therefore, the

systems handling biological data should be flexible in data types and values. Constraints on data types and values must be put with care, since the unexpected values (e.g. outliers), which are not uncommon in biological data could be excluded resulting in the lost of information.

3. Schemas in biological databases change rapidly. This requires improved

information flow between various database releases, as well as schema evolution and data object migration support. In most relational and object database systems the ability to extend the schema is not supported. What happens now is that many

biological/bioinformatics databases (such as GenBank, for example) release the entire database with new schemas once or twice a year rather than incrementally change the system as a change is needed.

4. Representations of the same data by different biologists will likely be different

(even using the same system). Thus, it is necessary to have mechanisms, which could align different biological schemas.

5. Most users of biological data need read-only access only, whereas write access

to the database is not required. Usually curators of the databases are the ones who need write access privileges. The vast majority of users generate a wide variety of read-access patterns into the database, but these patterns are not the same as those seen in traditional relational databases. User requested searches demand indexing of often unexpected combinations of data classes.

6. Most biologists don’t have knowledge of the internal structure of the database or

about its schema design. Biological database interfaces should display information to users in a manner that is applicable to the problem they are trying to address and that reflects the underlying data structure in an easily understandable manner. Biologists usually know what data they require, but they have no technical knowledge of the data structure or how a DBMS represents the data. Relational database schemas fail to provide intuitive information to the user regarding the meaning of their schema. Web interfaces, on the other hand, often provide preset search interfaces, which may limit access into the database.

7. The context of data gives added meaning for its use in biological applications.

Therefore it is important that the context is maintained and conveyed to the user when appropriate. It is also advantageous to integrate as many context as possible to maximize the interpretation of the biological data. For instance, the sequence of DNA is not very useful without information describing its organization, function, etc.

8. Defining and representing complex queries is extremely important to the

biologist. Hence, biological systems must support complex queries and provide tools for building such queries.

9. Users of biological information often require access to “old” values of the data –

particularly when verifying previously reported results. Therefore, the changes in values must be supported through the archives to enable researchers to reconstruct previous work and reevaluate prior and current information. All these specific characteristics of the biological data point to the fact that traditional DBMS’s do not fully satisfy the requirements put on complex biological data.

Existing Biological Databases

It has been estimated that are over major 1,000 public and commercial biological

databases currently available to scientific community (by the end of 2006). These

biological databases usually contain genomic and/or proteomics data. Some databases are also used in taxonomy. As already has been mentioned, the biological databases incorporate enormous ammount of various types of biological data including (but certainly not limited to) nucleotide sequences of genes, amino acids sequences of proteins, information about their function, structure, localization on chromosome, clinical effects of mutations, protein-ligand, gene-ligand interactions, as well as similarities of biological sequences can be found and so on. By far the most important resource for biological databases is a special yearly January issue of the journal Nucleic Acids Research. This issue categorizes all the publicly available online databases related to bioinformatics.

The most important biological databases can be roughly classified into the following

groups:

- Primary sequence databases (include International Nucleotide Sequence Database

(INSD) consisting of DDBJ [DNA Data Bank of Japan], EMBL Nucleotide DB [European Molecular Biology Laboratory] and GenBank [National Center for Biotechnology Information]). These databanks represent the current knowledge about the sequences of all organisms. They interchange the stored information and are the source for many other databases.

- Meta-databases (include MetaDB containing links and descriptions for over 1200

biological databases, Entrez [National Center for Biotechnology Information], euGenes [Indiana University], GeneCards [Weizmann Institute], SOURCE [Stanford University], Harvester [EMBL Heidelberg] and other). These meta-database can be considered a database of databases, rather than one integration project or technology. They collect information from different other sources and usually make them available in new and more convenient form.

- Genome browsers (e.g. Integrated Microbial Genomes system, The Ensembl

Genome Browser [Sanger Institute and European Bioinformatics Institute] and many other). Genome Browsers enable researchers to visualize and browse entire genomes of organisms with annotated data including gene prediction and structure, proteins, expression, regulation, variation, comparative analysis, etc. Annotated data is coming usually from multiple diverse sources.

- Specialized databases (Human Genome Organization database, SHMPD The

Singapore Human Mutation and Polymorphism Database and many other databases).

- Pathway databases (e.g. BioCyc Database, Reactome and other)

- Protein sequence databases (UniProt [UniProt Consortium: EBI, Expasy, PIR],

Swiss-Prot Protein Knowledgebase [Swiss Institute of Bioinformatics] and many other)

- Protein structure databases (Protein Data Bank, CATH Protein Structure

Classification, SCOP Structural Classification of Proteins etc.)

- Microarray databases (ArrayExpress [EBI], SMD [Stanford University], etc.)

- Protein-Protein Interactions (BioGRID [Samuel Lunenfeld Research Institute],

STRING [EMBL])

Database bioinformatics tools

While database systems provide facilities to manage large data volumes, many

database systems only partially provide support for the numeric computations required to perform statistical assessment of scientific data and therefore require further development. This shortcoming limits the use of database systems by scientific users. The integration of numerical algebraic calculations enables to perform automatic optimization of entire computations, with the resulting benefits of query optimization, algorithm selection and data independence becoming available to computations on scientific databases. This removes the barrier between the database system and the computation allowing the database optimizer to manipulate a larger portion of the application.

Algebraic Optimization of Computations.

A pioneering work of Wolniewicz and Graefe extends the concept of database

query and show on a case study how numeric computation over time series data can be implemented effectively in scientific databases. This frees the user from concerns about ordering of computation steps, algorithm selection, use of indices and other physical properties such as data distribution. The authors developed a scientific optimizer using the “Volcano” optimizer generator, which could perform logical transformations and physical algorithm selection.

In the optimization of scientific computations, the identification of suitable

transformation rules is of central importance. Once applicable transformation rules have been found and applied to generate equivalent logical expressions, the optimizer must find a set of physical algorithms that can implement or execute each expression. For instance, a join operator can be implemented as either a merge- or hash-based algorithm, while an interpolation can be implemented by any of variety of curve fitting algorithms. Other query optimizer issues include limiting the search space, detecting common subexpressions and improving cost estimation.

Time series can be viewed as sets there each record is assumed to be tagged with a

time value. Start and stop time, the average and maximum difference between samples, and whether or not the time differences are constant between adjacent records are all important for some operations. Spectra are treated in a manner very similar to time series, but with a frequency attribute attached to each record rather than a time value.

The operators supported within the system are divided into two groups, logical and physical. The user’s original computation is composed of logical operators, while the final execution plan is expressed to the database in terms of physical operators. The authors demonstrate that besides standard relational operators as “select”, “project” and “join” other necessary operators can be included such as “a random sampling operator”, “digital filtering procedure”, “recomputation of the value of a single record based upon an averaging function applied to nearby records”, “interpolations” and “extrapolation” operators, operator for “merging two time series”, as well as “spectral filter” operator.

The authors also include simple math function applications such as “correlation”,

“convolution” and “deconvolution” of spectra. Additionally, physical operators

implemented as iterators are included such as “pass a window over a sorted data set”, “fast Fourier Transform” and some other. It should be noted that some operators (e.g. Fourier transform of the time series and spectra) are “expensive” operators to perform and, therefore, the decision on when to move between normal and Fourier-space is important for optimization process. Logical transformations for scientific operators are also vital to the application of optimization. There are a number of transformations in specific scientific domains that are considered valid and thus needs to be implemented.

Moreover, the user should be able to enable or disable certain transformations to control the accuracy of the results of the equation. The authors conclude that it is very beneficial to remove the barrier between the database system and scientific computations over the database. Thus, automatic optimization of integrated algebras is a crucial step in supporting scientific computations over database systems.

Sequence Retrieval System (SRS)

The high amount and diversity of data available in the large number of different

databases to the scientific community is to the user’s advantage but it also creates the problem of knowing exactly where to get specific information. Another problem is that different databases with different contents also use different formats adding further technical difficulties to the already complex task of accessing and exploiting the data.

Even though many databases have developed linking systems to relate to data stored

elsewhere, these links are difficult to use due to the differences between individual

database implementations. With the expansion of databases, quick access to their contents has become an important issue. Different technical solutions are used or are under development at the major bioinformatics sites. The major contribution to these field was development of Sequence Retrieval System (SRS) at EBI. SRS has addressed many of the difficulties in database access, integration of databases and analysis tools. SRS is an integration system for both data retrieval and data analysis applications. It provides, under a uniform and simple to use interface, a high degree of linking between the databases in the system. This allows the user to perform simple and complex queries across different databases. Its original way of linking makes it possible to query across databases that even do not contain direct cross-references to each other.

The user even is able to ask questions like “Give me all the proteins that share InterPro domains with my protein” or “Give me all the known 3D structures of this set of proteins”. The SRS server at EBI (http://srs.ebi.ac.uk) contains more than 140 biological databases (including sequence and sequence related databases, bibliography, metabolic pathways, 3D structure and many other databases) and integrates many analysis tools. Results of such analyses are themselves indexed as SRS databases and can thus be linked to others by using predefined or user-defined SRS views.

Overall, the EBI and other major world bioinformatics centers provide a whole

range of analysis tools for databases. There are currently more than 60 distinct services available at the EBI such as a range of sequence homology and similarity algorithms like (FASTA, BLAST and Smith-Waterman), sequence analysis tools (many European Molecular Biology Open Software Suite applications), gene and structural prediction methods

Searching and retrieval of information from world wide web.

http://www.jasonmorrison.net/iakm/cited/Gordon_Pathak.pdf

Structure databases-PDB (protein data Bank)

The Protein Data Bank (PDB) is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids. (See also crystallographic database). The data, typically obtained by X-ray crystallography or NMR spectroscopy and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organisations (PDBe, PDBj, and RCSB). The PDB is overseen by an organization called the Worldwide Protein Data Bank, wwPDB.

The PDB is a key resource in areas of structural biology, such as structural genomics. Most major scientific journals, and some funding agencies, such as the NIH in the USA, now require scientists to submit their structure data to the PDB. If the contents of the PDB are thought of as primary data, then there are hundreds of derived (i.e., secondary) databases that categorize the data differently. For example, both SCOP and CATH categorize structures according to type of structure and assumed evolutionary relations; GO categorize structures based on genes.

Two forces converged to initiate the PDB: 1) a small but growing collection of sets of protein structure data determined by X-ray diffraction and 2) the newly available (1968) molecular graphics display, the Brookhaven RAster Display (BRAD), to visualize these protein structures in 3-D. In 1969, with the sponsorship of Dr. Walter Hamilton at the Brookhaven National Laboratory, Dr. Edgar Meyer (Texas A&M University) began to write software to store atomic coordinate files in a common format to make them available for geometric and graphical evaluation. By 1971, one of Dr. Meyer's programs, SEARCH, enabled researchers to remotely access information from the database to study protein structures offline. SEARCH was instrumental in enabling networking, thus marking the functional beginning of the PDB.

Upon Hamilton's death in 1973, Dr. Tom Koeztle took over direction of the PDB for the subsequent 20 years. In January 1994, Dr. Joel Sussman of Israel's Weizmann Institute of Science was appointed head of the PDB. In October 1998, the PDB was transferred to the Research Collaboratory for Structural Bioinformatics (RCSB); the transfer was completed in June 1999. The new director was Dr. Helen M. Berman of Rutgers University (one of the member institutions of the RCSB). In 2003, with the formation of the wwPDB, the PDB became an international organization. The founding members are PDBe (Europe), RCSB(USA), and PDBj (Japan). The BMRB joined in 2006. Each of the four members of wwPDB can act as deposition, data processing and distribution centers for PDB data. The data processing refers to the fact that wwPDB staff review and annotates each submitted entry. The data are then automatically checked for plausibility (the source code for this validation software has been made available to the public at no charge).

The PDB database is updated weekly (UTC+0 Wednesday). Likewise, the PDB Holdings List is also updated weekly. As of 1 May 2012, the breakdown of current holdings is as follows:

Experimental Method	Proteins	Nucleic Acids	Protein/Nucleic Acid complexes	Other	Total
X-ray diffraction	66550	1352	3306	2	71210
NMR	8208	979	186	7	9380
Electron microscopy	285	22	118	0	425
Hybrid	44	3	2	1	50
Other	140	4	5	13	162
Total:	75227	2360	3617	23	81227

60,610 structures in the PDB have a structure factor file.

6,687 structures have an NMR restraint file.

454 structures in the PDB have a chemical shifts file.

These data show that most structures are determined by X-ray diffraction, but about 15% of structures are now determined by protein NMR. When using X-ray diffraction, approximations of the coordinates of the atoms of the protein are obtained, whereas estimations of the distances between pairs of atoms of the protein are found through NMR experiments. Therefore, the final conformation of the protein is obtained, in the latter case, by solving a distance geometry problem. A few proteins are determined by cryo-electron microscopy. (Clicking on the numbers in the original table will bring up examples of structures determined by that method.)

The significance of the structure factor files, mentioned above, is that, for PDB structures determined by X-ray diffraction that have a structure file, the electron density map may be viewed. The data of such structures is stored on the "electron density server", where the electron maps can be viewed.

In the past, the number of structures in the PDB has grown at an approximately exponential rate. However, since 2007, the rate of accumulation of new proteins appears to have plateaued:

Year	# added
2007	7263
2008	7073
2009	7448
2010	7971
2011	8120

The file format initially used by the PDB was called the PDB file format. This original format was restricted by the width of computer punch cards to 80 characters per line. Around 1996, the "macromolecular Crystallographic Information file" format, mmCIF, started to be phased in. An XML version of this format, called PDBML, was described in 2005.^[6] The structure files can be downloaded in any of these three formats. In fact, individual files are easily downloaded into graphics packages using web addresses:

§ For PDB format files, use, e.g., http://www.pdb.org/pdb/files/4hhb.pdb.gz or http://pdbe.org/download/4hhb

§ For PDBML (XML) files, use, e.g., http://www.pdb.org/pdb/files/4hhb.xml.gz or http://pdbe.org/pdbml/4hhb

The "4hhb" is the PDB identifier. Each structure published in PDB receives a four-character alphanumeric identifier, its PDB ID. (This cannot be used as an identifier for biomolecules, because often several structures for the same molecule—in different environments or conformations—are contained in PDB with different PDB IDs.)

Viewing the Structure

The structure files may be viewed using one of several open source computer programs. Some other free, but not open source programs include ICM-Browser, VMD, MDL Chime, Pymol, UCSF Chimera, Rasmol, Swiss-PDB Viewer, StarBiochem (a Java-based interactive molecular viewer with integrated search of protein databank), Sirius, and VisProt3DS (a tool for Protein Visualization in 3D stereoscopic view in anaglyth and other modes). The RCSB PDB website contains an extensive list of both free and commercial molecule visualization programs and web browser plugins.

Molecular Modeling databases(MMDB)

(Refer the following site for complete information on MMDB)

http://www.ncbi.nlm.nih.gov/Structure/MMDB/docs/mmdb_help.html#WhatIs

Primary databases NCBI,EMBL,DDBJ.

NCBI

The National Center for Biotechnology Information (NCBI) provides a comprehensive website for biologists that includes biology-related databases, and tools for viewing and analyzing the data inherent in the databases. A division of the National Library of Medicine at the National Institutes of Health, NCBI is the agency responsible for creating automated systems for storing and analyzing the rapidly growing profusion of genetic and molecular data. One of the most difficult challenges faced in the field of bioinformatics is how to store, in an easily accessible manner, the overwhelming abundance of new information, including the sequences of entire genomes, the ongoing discoveries of new genes and gene products, and the determinations of their functions and structures. NCBI was established as the government's response to the need for more and better information processing methods to deal with this challenge.

View the NCBI home page. A relatively good overview of the tools and databases that can be accessed through NCBI is provided in the list along the left border of the home page. Clicking on the link entitled "About NCBI" produces a second menu containing the topics "A Science Primer", and "Databases and Tools", among others. Selecting "A Science Primer" yields access to general definitions and introductory information regarding the branches of science included in bioinformatics. Many bioinformatics terms are defined in this section in a clear-cut and basic manner, making this Primer an excellent first resource. Selecting "Databases and Tools" from the "About NCBI" webpage menu yields a complete and well-ordered listing of accessible information. This web page containing the databases and tools menu is a good choice for those who are inclined toward bookmarking.

The first item under the "Databases and Tools" menu is "Literature Databases". PubMed is the most heavily used of the literature databases and can be used to access MEDLINE biological and medical scientific journal citations dating back to articles written in the mid-1960's. The second item under the "Databases and Tools"menu is "Entrez Databases". Entrez is a search and retrieval system developed by NCBI that is capable of accessing integrated information by searching many of the NCBI databases with just one query (instead of searching only one database per query, then having to repeat the query to find information on the same topic from another NCBI database). The NCBI databases that are included in the search when you launch an Entrez query are shown when you click on this link. The "Nucleotide Databases" link under the "Databases and Tools" menu lists all the sequence databases available through NCBI. These sequence databases contain annotated collections of publicly available DNA, RNA and protein sequences. The evolution of bioinformatics data mining methods has been largely driven by the prodigious amount of sequence information collected by scientists in recent years. New sequences of unknown function can be compared with sequences of well-characterized genes and proteins. Similarities can be identified between the new, unknown sequences and the well-characterized sequences, and used to postulate theories regarding function or structure.

Among the tools listed under the NCBI "Databases and Tools" menu, are "Tools for Data Mining". Selecting the "Tools for Data Mining" topic will show a list of data retrieval tools, including Entrez, mentioned above, and BLAST, the Basic Local Alignment Search Tool. Blast is the predominant sequence alignment tool for performing rapid searches of nucleotide and protein sequence databases and detecting local, as well as global, sequence alignments between the query sequence and the database sequences.

This is a brief glimpse at some of the more widely used tools and databases presented by NCBI, presented with the intention of helping the novice get some feel for the number and types of bioinformatics tools that are available on the internet today. Several of these tools are covered in more detail in subsequent modules included in this bioinformatics course. Before proceeding to the next module, take a moment to return to the "About NCBI" webpage menu and glance through some of the interesting webpa ges linked under the topics "A Science Primer", "Outreach and Education", and "News".

Basic Research

NCBI has a multi-disciplinary research group composed of computer scientists, molecular biologists, mathematicians, biochemists, research physicians, and structural biologists concentrating on basic and applied research in computational molecular biology. These investigators not only make important contributions to basic science but also serve as a wellspring of new methods for applied research activities. Together they are studying fundamental biomedical problems at the molecular level using mathematical and computational methods. These problems include gene organization, sequence analysis, and structure prediction. A sampling of current research projects includes: detection and analysis of gene organization, repeating sequence patterns, protein domains and structural elements, creation of a gene map of the human genome, mathematical modeling of the kinetics of HIV infection, analysis of effects of sequencing errors for database searching, development of new algorithms for database searching and multiple sequence alignment, construction of non-redundant sequence databases, mathematical models for estimation of statistical significance of sequence similarity, and vector models for text retrieval. Additionally, NCBI investigators maintain ongoing collaborations with several institutes within the NIH and also with numerous academic and government research laboratories.

Databases and Software

NCBI assumed responsibility for the GenBank DNA sequence database in October 1992. NCBI staff with advanced training in molecular biology build the database from sequences submitted by individual laboratories and by data exchange with the international nucleotide sequence databases, European Molecular Biology Laboratory (EMBL) and the DNA Database of Japan (DDBJ). Arrangements with the U.S. Patent and Trademark Office enable the incorporation of patented sequence data.

In addition to GenBank, NCBI supports and distributes a variety of databases for the medical and scientific communities. These include the Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB) of 3D protein structures, the Unique Human Gene Sequence Collection (UniGene), a Gene Map of the Human Genome, the Taxonomy Browser, and the Cancer Genome Anatomy Project (CGAP), in collaboration with the National Cancer Institute.

Entrez is NCBI's search and retrieval system that provides users with integrated access to sequence, mapping, taxonomy, and structural data. Entrez also provides graphical views of sequences and chromosome maps. A powerful and unique feature of Entrez is the ability to retrieve related sequences, structures, and references. The journal literature is available through PubMed, a Web search interface that provides access to over 11 million journal citations in MEDLINE and contains links to full-text articles at participating publishers' Web sites.

BLAST is a program for sequence similarity searching developed at NCBI and is instrumental in identifying genes and genetic features. BLAST can execute sequence searches against the entire DNA database in less than 15 seconds. Additional software tools provided by NCBI include: Open Reading Frame Finder (ORF Finder), Electronic PCR, and the sequence submission tools, Sequin and BankIt. All of NCBI's databases and software tools are available from the WWW or by FTP. NCBI also has email servers that provide an alternative way to access the databases for text searching or sequence similarity searching.

EMBL

The European Molecular Biology Laboratory (EMBL) is a molecular biology research institution supported by 20 European countries and Australia as associate member state. EMBL was created in 1974 and is an intergovernmental organisation funded by public research money from its member states. Research at EMBL is conducted by approximately 85 independent groups covering the spectrum of molecular biology. The Laboratory operates from five sites: the main Laboratory in Heidelberg, and Outstations in Hinxton (the European Bioinformatics Institute (EBI)), Grenoble, Hamburg, and Monterotondo near Rome.

Each of the sites has a specific research field. The EBI is a hub for bioinformatic research and services, developing and maintaining a large number of databases which are free of charge for the scientific community. At Grenoble and Hamburg, research is focused on structural biology. EMBL's dedicated Mouse Biology Unit is located in Monterotondo. At the headquarters in Heidelberg, there are units in Cell Biology and Biophysics, Developmental Biology, Genome Biology and Structural and Computational Biology as well as service groups complementing the aforementioned research fields.

Many scientific breakthroughs have been made at EMBL, most notably the first systematic genetic analysis of embryonic development in the fruit fly by Christiane Nüsslein-Volhard and Eric Wieschaus, for which they were awarded the Nobel Prize for Medicine in 1995.

The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/), maintained at the European Bioinformatics Institute (EBI), incorporates, organizes and distributes nucleotide sequences from public sources. The database is a part of an international collaboration with DDBJ (Japan) and GenBank (USA). Data are exchanged between the collaborating databases on a daily basis to achieve optimal synchrony. The web-based tool, Webin, is the preferred system for individual submission of nucleotide sequences, including Third Party Annotation (TPA) and alignment data. Automatic submission procedures are used for submission of data from large-scale genome sequencing centres and from the European Patent Office. Database releases are produced quarterly. The latest data collection can be accessed via FTP, email and WWW interfaces. The EBI’s Sequence Retrieval System (SRS) integrates and links the main nucleotide and protein databases as well as many other specialist molecular biology databases. For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are available that allow external users to compare their own sequences against the data in the EMBL Nucleotide Sequence Database, the complete genomic component subsection of the database, the WGS data sets and other databases. All available resources can be accessed via the EBI home page at http://www.ebi.ac.uk.

INTRODUCTION

The mission of the Service Programme at the EBI is the building, maintenance and provision of biological databases and other information services to support data deposition and access by the scientific community. Databases provided at the EBI include the EMBL Nucleotide Sequence Database, the protein databases Swiss-Prot, TrEMBL and UniProt, InterPro, the Macromolecular Structure Database (E-MSD), the gene expression database ArrayExpress and the Ensembl automatic genome annotation database.

In Europe, most nucleotide sequence data and supporting bibliographical and biological data generated are collected and distributed by the EMBL Nucleotide Sequence Database. The EMBL database is a member of the International Nucleotide Sequence Database Collaboration DDBJ/EMBL/GenBank. The main sources of data in the EMBL database are large-scale genome sequencing projects, direct submissions from individual scientists and sequence data extracted from biotechnology patent applications to the European Patent Office. To achieve optimal worldwide synchrony, all new and updated database records are exchanged on a daily basis between EMBL, DDBJ (8) and GenBank. Third Party Annotation (TPA) and CONstructed (CON) records are also exchanged daily, while Whole Genome Shotgun (WGS) data sets are exchanged when they become available or have been updated.

EMBL database releases, with accompanying release notes, are produced quarterly.

Within the last 12 months the database size has increased from 18.3 million entries comprising 23 Gb (Release 72, September 2002) to 27.2 million entries comprising over 33 Gb (Release 76, September 2003). The number of organisms represented in the database is now ∼150 000.

During the course of 2003, the EMBL Sequence Version Archive was launched, the WGS data collection and distribution procedure was further developed and the data collection rules for the TPA data set continued to be revised. A detailed and up-to-date description of EMBL Nucleotide Sequence Database activities can be found athttp://www.ebi.ac.uk/embl/.

SUBMISSIONS TO THE EMBL NUCLEOTIDE SEQUENCE DATABASE

A repository of primary nucleotide sequences is an essential requirement for computational analysis and genome research. Furthermore, molecular biologists depend on free access to such a repository. Many journals require authors to submit sequence information to the EMBL, GenBank or DDBJ database prior to publication in order to ensure its availability to scientists.

An introduction to database submission procedures is described below. For comprehensive details of procedures, please seehttp://www.ebi.ac.uk/embl/Submission/.

Webin

Webin is the preferred submission system for nucleotide sequence and biological annotation. Webin has been designed to allow rapid submission of single, multiple or very large numbers of sequences (bulk submissions) and is available athttp://www.ebi.ac.uk/embl/Submission/webin.html. Webin has been modified to accept TPA submissions.

Genome project submissions

Database entries produced at the sequencing site can be deposited and updated directly by the submitters using FTP or email. Groups producing large volumes of genome sequence data over an extended period of time are advised to contact the database at datasubs@ebi.ac.uk.

Alignment submissions

EMBL-Align (10) is a public data set of both protein and nucleotide multiple sequence alignments and can be queried from the EBI Sequence Retrieval Server (SRS) server. It was developed in response to the need for permanent electronic storage and standardized presentation of alignment data from phylogenetic and population analyses. Webin-Align is a dedicated web-based tool for submission of multiple sequence alignments in all common alignment formats. Webin-Align is available athttp://www.ebi.ac.uk/embl/Submission/align_top.html.

ACCESSING NUCLEOTIDE SEQUENCE DATA AND RELATED DATA AT THE EBI

The EMBL Nucleotide Sequence Database is available from the EBI network services, interactively via the WWW or by email, using netserv (netserv@ebi.ac.uk). EMBL data sets are freely available from the EBI FTP server atftp://ftp.ebi.ac.uk/pub/databases/embl/. For more information seehttp://www.ebi.ac.uk/embl/Access/.

Completed genome sequences and proteome analysis

Direct access to several thousand completely sequenced genomic components is available via the EBI Genomes server at http://www.ebi.ac.uk/genomes/. Proteome analysis information on all completely sequenced organisms is available athttp://www.ebi.ac.uk/proteome/ .

Whole Genome Shotgun (WGS) data

WGS data are available at ftp://ftp.ebi.ac.uk/pub/databases/embl/wgs/. At the time of writing (September 2003), data sets from 70 separate WGS projects are available. The largest data set is that of Rattus norvegicus. While many of the WGS data sets are not annotated, some biological features are present in some of the sets. The WGS data set for Anopheles gambiae strain PEST is an example with annotation. WGS data sets can now be searched using the FASTA algorithm (see below).

Sequence Retrieval System (SRS)

The EMBL Nucleotide Sequence Database can be accessed via the EBI SRS server (12,13) at http://srs.ebi.ac.uk/. In SRS, the data are available in the following libraries:

(i) EMBL: the database in its entirety by means of a virtual library comprising EMBLRELEASE, EMBLNEW, EMBLTPA and EMBLWGS;

(ii) EMBLRELEASE: library containing the latest official release of the EMBL Nucleotide Sequence Database;

(iii) EMBLNEW: library containing updated and new entries created since the last

official release;

(iv) EMBLTPA: library containing TPA entries;

(v) EMBLWGS: library containing WGS entries;

(vi) EMBLCON: library containing CON entries.

Sequence searching

A comprehensive set of sequence analysis and database search algorithms is available at http://www.ebi.ac.uk/Tools/. Sequence similarity searches are available interactively over the WWW as well as by email. Users can search the EMBL Nucleotide Sequence Database as a whole or by individual taxonomic division.

The most commonly used algorithms available are FASTA and WU-BLAST, permitting comparisons between nucleotide query sequences and the nucleotide or protein databases as well as searches of protein query sequences against the nucleotide database.

The FASTA service for genomes and proteomes

(http://www.ebi.ac.uk/fasta33/genomes.html) enables users to search interactively completed genomes and proteomes. The same searches can be performed by email (gpfasta@ebi.ac.uk). User instructions are available by sending an email with the word HELP in the body of the message to gpfasta@ebi.ac.uk. WGS data sets are now available for searching.

Sequence analysis

Sequence analysis programs offered include multiple sequence alignment and inference of phylogenies using ClustalW, protein classification using InterProScan and others. The EBI also provides interactive sequence analysis resources based on the European Molecular Biology Open Software Suite (EMBOSS) (http://www.emboss.org/).

DEVELOPMENTS

Sequence length limits

Currently, database records are limited in length to 350 000 bp. At the DDBJ/EMBL/GenBank collaborative meeting of May 2003, a decision was taken to remove the size restriction on database records in June 2004.

This development will allow the entire sequence derived from a naturally occurring biological unit to be stored as a single database entry, thus eliminating the need to split long sequences into segments and create CON entries to store the assembly information (19). Currently, ∼3% of all base pairs in the database are stored in the constituent segment entries of CON entries.

Third Party Annotation (TPA) data set

Until recently, the collaborative databases have collected and distributed only primary nucleotide sequence and annotation data resulting from direct sequencing of such molecules as cDNAs, ESTs and genomic DNA. ‘Primary data’ is defined as annotated sequence that has been determined by submitters and their teams. Primary database entries remain in the ownership of the original submitter and the co-authors of the submission publication(s). The owners of database entries have privileges to implement updates to the data.

In response to demand from the research community, the collaborative databases have created the TPA data set. The types of data that make up the TPA data set include reannotations of existing entries, combinations of novel sequence and existing primary entries and annotation of trace archive and WGS data.

TPA data are submitted using Webin. Submitters are required to provide DDBJ/EMBL/GenBank accession and version numbers and nucleotide locations for all primary entries to which their TPA entry relates. For TPA sequences composed from trace archive data, the identifier (e.g. TI123445566) and corresponding nucleotide locations must be provided.

TPA entries can be distinguished easily from their primary counterparts. The abbreviation ‘TPA:’ appears at the beginning of each description (DE) line and the keywords ‘Third Party Annotation’ and ‘TPA’ appear in the keyword (KW) line.

AH TPA_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP

AS 1–251 BE529226.1 1–251

AS 68–450 BE524624.1 1–383

AS 394–1086 AJ420881.1 1–693

AS 826–1211 AV561543.1 1–386

The flat-file extract shown above (from BN000024) shows the two new line types that have been created for TPA entries. The Assembly Header (AH) line provides column headings for the assembly information. The Assembly (AS) lines provide information on the composition of the TPA sequence by listing base span(s) of the TPA sequence together with identifiers and base span(s) of contributing sequences.

In order to ensure sequence annotation of the highest quality, entries that are yet to be discussed in peer-reviewed publications are held confidential and are not visible to database users. This is an important difference from our policy of data release for primary entries.

At the time of writing (September 2003), 457 TPA entries are publicly available, of which 150 entries are of human origin. The second most common source organism for this type of entry is mouse, with 95 entries, showing that, so far, the TPA data set follows the same pattern as the primary dataset. Statistics for all EMBL Nucleotide Sequence Database data, including top 10 organisms by base count, can be found athttp://www3.ebi.ac.uk/Services/DBStats/. Further information on the TPA dataset can be found at http://www.ebi.ac.uk/embl/Documentation/third_party_annotation_dataset.html and instructions on data submission can be found athttp://www.ebi.ac.uk/webin/webin_help.html.

EMBL Sequence Version Archive (SVA)

The EMBL SVA (20) was created to provide access to all versions of EMBL Nucleotide Sequence Database entries, including CON, TPA and WGS data. There were 145 million entry versions in the archive by September 2003, and new versions are being added every day. Entries from all past EMBL Nucleotide Sequence Database releases, starting with the first release in 1982, have been loaded into the archive.

Each time an EMBL database entry is created or modified it is loaded into the archive, where it can be accessed and compared with other versions of the same entry. If an entry is updated, corrected or extended as a result of new findings from recent experiments, the entry version is incremented. Changes in the taxonomic lineage, or flat-file formatting changes are not reflected in the entry version. For this reason, the archive may contain several variants of an entry with the same entry version number.

The archive can be accessed interactively at http://www.ebi.ac.uk/embl/sva/ and programmatically at http://www.ebi.ac.uk/cgi-bin/dbfetch.

Entries can be retrieved interactively using accession numbers, protein identifiers and sequence versions. The user chooses to view either the complete chronological history of an entry or the entry version that was current at a specified date. The resulting entry versions can be viewed, downloaded and compared. The interactive interface can also be reached by following hyperlinks from the EBI SRS query results page when working with EMBL Nucleotide Sequence Database and EMBL-Align entries. As an example of programmatic entry retrieval, the following URL returns the latest EMBL entry having the accession number AC067752: http://www.ebi.ac.uk/cgi-bin/dbfetch?db=SVA&id=AC067752&format=default.

XML format for data exchange

The EMBL Nucleotide Sequence Database has initiated efforts to produce an XML format for the distribution of entries. The development of this format will be carried out in collaboration with DDBJ and GenBank with the aim of developing a common representation for the distribution of data.

DDBJ

The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration or INSDC. It exchanges its data with European Molecular Biology Laboratory at the European Bioinformatics Institute and with GenBank at the National Center for Biotechnology Information on a daily basis. Thus these three databanks contain the same data at any given time.

DDBJ began data bank activities in 1986 at NIG and remains the only nucleotide sequence data bank in Asia. Although DDBJ mainly receives its data from Japanese researchers, it can accept data from contributors from any other country. DDBJ is primarily funded by the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT). DDBJ has an international advisory committee which consists of nine members, 3 members each from Europe, US, and Japan. This committee advises DDBJ about its maintenance, management and future plans once a year. Apart from this DDBJ also has an international collaborative committee which advises on various technical issues related to international collaboration and consists of working-level participants.

DDBJ; DNA Data Bank of Japan is the sole nucleotide sequence data bank in Asia, which is officially certified to collect nucleotide sequences from researchers and to issue the internationally recognized accession number to data submitters.Since we exchange the collected data with EMBL-Bank/EBI; European Bioinformatics Institute and GenBank/NCBI; National Center for Biotechnology Information on a daily basis, the three data banks share virtually the same data at any given time. The virtually unified database is called "INSD; International Nucleotide Sequence Database".DDBJ collects sequence data mainly from Japanese researchers, but of course accepts data and issue the accession number to researchers in any other countries.

DDBJ is organized by CIB-DDBJ; Center for Information Biology and DNA Data Bank of Japan of NIG; National Institute of Genetics with endorsement of MEXT; Japanese Ministry of Education, Culture, Sports, Science and Technology. 99% of INSD data from Japanese researchers are submitted through DDBJ.

The principal purpose of DDBJ operations is to improve the quality of INSD, as public domains. When researchers make their data open to the public through INSD and commonly shared in world wide, we at DDBJ make efforts to describe information on the data as rich as possible, according to the unified rules of INSD, preferably without any stress by using DDBJ

Explanation of DDBJ flat file format

The database is a collection of "entry" which is the unit of the data. The entry submitted to DDBJ is processed and publicized according to the DDBJ format for distribution (flat file). The flat file includes the sequence and the information of submitters, references, source organisms, and "feature" information, etc. The "feature" is defined by DDBJ/EMBL/GenBank Feature Table Definition to describe the biological nature such as gene function and other property of the nucleotide sequence.

Database Search

getentry

Data retrieval by accession numbers, etc.

ARSA

All-round Retrieval of Sequence and Annotation

TXSearch

Retrieval of unified taxonomy database

BLAST

Homology Search

DDBJ Vector Screening System

Phylogenetics

ClustalW

Multiple alignment and Tree-making

Submission of Gene Expression Data

CIBEX ※suspended

Gene expression DB, data submission and search by MIAME as intenational guide line

Genome Analyses

GIB ※suspended

Genome information broker

GIB-V ※suspended

GIB for Viruses

MiGAP

Mechanical annotation tool　for microbial genomes
(Login ID is required.)

GTPS

Reannotation of bacterial genomes using a new common protocol

GTOP

Genome to protein structure and function

Next Generation Sequence Analysis

DDBJ Read Annotation Pipeline ※suspended

High-throughput data analysis of next generation sequence data (Login ID is required)

Protein Database and Structure

PMD

Protein mutant database

Software developed at CIB-DDBJ

WINA

A Window Analysis Program for the Number of Synonymous and Nonsynonymous Nucleotide

DendroMaker for Macintosh

Software package for drawing dedrograms