Custom Search

Thursday, May 8, 2008

Biological Databases

Biological and Protein Databases

The completion of the sequencing of the Human genome was achieved in April, 2003. This completion has had an outstanding effect on how biological and biomedical research is conducted. The sequencing has given us information on human sequence variation data, model organism sequence data, and information on gene structure and function which all provide ground for the researchers to better design and interpret their experiments, fulfilling the promise of bioinformatics in advancing and accelerating biological discovery.
GenBank is the database in which most researchers are familiar wth. GenBank is the annotated collection of all publicly available DNA and protein sequences. This database, maintained by National Center for Biotechnology Information (NCBI) at the National Institutes of Health, represents a collaborative effort between NCBI, the European Molecular Biology Laboratory (EMBL), and the DNA Data Bank of Japan (DDBJ).
The Human Genome Project along with other sequencing projects has allowed for a vast number of sequence data. For example the number of bases in GenBank doubles every 14 months, and this exponential growth rate is expected to continue for some time to come.
GenBank, or any other biological database for that matter, serves little purpose unless the data can be easily searched and entries retrieved in a usable, meaningful format. Otherwise, sequencing efforts have no useful end, since the biological community as a whole cannot make use of the information hidden within these millions of bases and amino acids.
Be that as it may, the range of publicly available biological data goes far beyond what is included in GenBank. Since the major public sequence databases need to be able to store data in a generalized fashion, often times these databases do not contain more specialized types of information that would be of interest to specific segments within the biological community. To address this, many smaller, specialized databases have emerged. These databases, which contain information ranging from strain crosses to gene expression data, provide a valuable adjunct to the more visible public sequence databases, and the user is encouraged to make intelligent use of both types of databases in their searches.
EMBL (Europe) (SRS and EMBL and EBI)
  • The EMBL (European Molecular Biology Laboratory) nucleotide sequence database is maintained by the European Bioinformatics Institute (EBI) in Hinxton, Cambridge, UK.
  • It can be accessed and searched through the SRS system at EBI, or one can download the entire database as flat files.

DDBJ (Japan)

  • The DNA Data bank of Japan began as collaboration with EMBL and GeneBank. The National Institute of Genetics runs it.
  • One can search for entries by accession number.
PROTEIN DATABASES
PRIMARY PROTEIN SEQUENCES DATABASE
PIR ( The protein Information Resource protein sequence database) -
  • PIR-PSD is maintained by National Biomedical research foundation ( NBRF), International Protein Information Database of Japan (JIPID) and Martinsired Institute for Protein Sequences (MIPS)
  • PIR-PSD data processing involves four major steps: import, merging, classification and annotation
  • The primary source for (PSD) are naturally occurring wild types.The sequences from Gene Bank/ EMBL/DDBJ translations, published literature and direct submission to PIR international
  • One can search for entries or do sequence similarity searches at PIR the site.
  • The database can also be downloaded as a set of flat files
  • PIR also produces the three dimensional structures available in Protein Databank (PDB).
  • The database is split in to four distinct sections, designated PIR1 PIR4, which differs in term of quality of data and level of annotation provided:
  • PIR1: Fully classified and annotated entries
  • PIR2: Includes preliminary entries which have not been thoroughly reviewed and may contain redundancy
  • PIR3:Unverified entries
  • PIR4: entries fall in to one of four category (i) conceptual translation of art factual sequences (ii) conceptual translation that are not transcribed or scribed translated (iii) conceptual translation that are extensively genetically engineered (iv) sequences that are not generally encoded and not produced on ribosome
  • Program are provided for data retrieval and sequencing searching via the NBRF-PIR database web Page

MIPS

  • The Martinsired Institute for protein sequences collects and processes sequences data for the tripartite PIR- International Protein Sequences database Project.
  • The database is distributed with PATCHX, a supplimentsuppliment of unverified protein sequences from external sources.

Access to the database is provided through web server: results of FastA similarity searches of all proteins within PIRsimilarity PIR- International and PATCHX are stored in a dynamically maintained database, allowing the instant access to Fast A results.

SWISS

  • Currently maintained by SIB and EBI/EMBL.
  • Provide high level annotation, including description of function of the protein , and structure of its domains, its post translation modification, variants and so on
  • Details of SWISS-PORT entry

TrEMBL

  • Computer annotated supplement to SWISS –PORT
  • Contains translation of all coding sequences (CDS) in EMBL.
  • SPSP-TrEMBL :Contains entries that will eventually be incorporated in to SWISS-PORT, but that not yet been manually annotated.
  • REMREM-TrEMBL Contains sequences that are not destined to be including in SWISS-PORT

NRL-3D

  • Produced by PIR from sequences extracted from the Brookhaven Protein databank (PDB)

SECONDARY DATABASES
PROSITE

  • PROSITE is a database of short protein sequence patterns and proPROSITE profiles that
    files characterize biologically significant sites in proteins.
  • It is a part of SWISS-PROT and is maintained in the same way as SWISS-PROT.
  • PROSITE is based on regular expressions describing characteristic sub-sequences of specific protein families or domains.

Profile

  • A position specific scoring table that encapsulates the sequence information within
    complete alignment is termed as profile

PRINTS

  • PRINTS provide a compendium of protein finger prints – groups of conserved
    motifs that characterize a protein family.

Pfam

  • Pfam is a database of protein families defined as domains (contiguouis segments of entire protein sequence). For each domain, it contains a multiple alignment of set of defining sequences (the seeds) and the other sequences in SWISS-PROT and TrEMBL that can be matched to that alignment.


BLOCKS

  • BLOCKS Patterns without gaps in aligned protein families defined by PROSITE, found by pattern searching and statistical sampling algorithms.

Identify

  • Identify is another automatically derived tertiary resource derived from BLOCK.
STRUCTURAL CLASSIFICATION DATABASE
SCOP (Structure Classification of Proteins)
SCOP Classification system
Family
Super family
Fold
CATH ( Class, Architecture, Topology, Homology)
CATH Classification
Class
Architecture
Topology
Homology
PDB sum
Analyze all the structure in the PDB

Biological databases provide a useful and informative role in the Biology community, providing the first step in being able to perform vigorous and accurate bioinformatic analyses.

No comments: