Bioinformatics and Functional Genomics

Chapter: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | App 1 | App 2


Chapter 2: Access to sequence data and literature information


Web resources from Chapter 2:
Website URL
Pfam http://www.sanger.ac.uk/Software/Pfam
CERN http://public.web.cern.ch/Public
Protein Data Bank http://www.rcsb.org/pdb
Entrez tutorials http://www.ncbi.nlm.nih.gov/Education/index.html
Ensembl http://www.ensembl.org
The Wellcome Trust Sanger Institute http://www.sanger.ac.uk/
EMBL http://www.ebi.ac.uk/embl
Swiss Institute of Bioinformatics http://www.expasy.ch/
Lion Biosciences http://www.lionbioscience.com
Sequence Retrieveal System (SRS) servers http://downloads.lionbio.co.uk/publicsrs.html
Protein Information Resource (PIR) http://pir.georgetown.edu
National Library of Medicine (NLM) http://www.nlm.nih.gov/
Medline via the SRS at the European Bioinformatics Instititute http://srs.ebi.ac.uk/
PubMed tutorial http://www.nlm.nih.gov/bsd/pubmed_tutorial/m1001.html
growth of Medline http://www.nlm.nih.gov/bsd/medline_growth.html
Medline languages http://www.nlm.nih.gov/bsd/medline_lang_distr.html)
NLM Gateway http://gateway.nlm.nih.gov
Association of Research Libraries (ARL) http://www.arl.org/scomm/edir/index.html
MeSH web site at NLM http://www.nlm.nih.gov/mesh/meshhome.html

 

Tables

Table 2-1. Species represented in GenBank from 1995-2001. (From http://www.ncbi.nlm.nih.gov/Taxonomy/)
  year
Species 1995 1996 1997 1998 1999 2000 2001
All species 15925 22862 32569 43159 61525 87168 113940
Viruses 1740 1964 2513 2794 3401 4165 5436
Bacteria 2899 3798 6015 8625 14209 22616 28186
Archaea 154 227 376 544 1003 1697 2038
Eukaryota 10339 15863 22539 29844 41295 56800 76099

 

 

 

 

Table 2-2. The twenty most sequenced organisms in GenBank (release 130.0), June 2002. From ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt
Entries Bases Species Common name
5768791 8,739,692,225 Homo sapiens human
3646641 3,630,891,163 Mus musculus mouse
432098 2,144,761,548 Rattus norvegicus rat
334021 679,777,809 Drosophila melanogaster fruit fly
348961 335,187,724 Arabidopsis thaliana thale cress (plant)
73378 284,340,107 Oryza sativa (japonica cultivar) rice
196431 220,065,371 Caenorhabditis elegans worm
299854 195,346,835 Brassica oleracea broccoli
137438 190,059,367 Oryza sativa rice
274512 171,222,170 Danio rerio zebrafish
189099 169,080,139 Tetraodon nigroviridis pufferfish
160454 158,805,609 Pan troglodytes chimpanzee
285297 132,985,011 Zea mays corn
238959 124,512,287 Bos taurus cow
266106 122,912,577 Glycine max soybean
209750 107,520,873 Xenopus laevis frog
166706 96,343,348 Medicago truncatula legume
155272 95,896,298 Anopheles gambiae mosquito
174493 91,200,209 Ciona intestinalis ascidian
155598 88,752,017 Dictyostelium discoideum slime mold

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 2-3. Top ten organism for which ESTs have been sequenced (dbEST release 81602 [August, 2002].Many thousands of cDNA libraries have been generated from a variety of organisms, and the total number of public entries is currently over 12 million. (http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html).
organism Common name number of ESTs
Homo sapiens human 4,550,451
Mus musculus + domesticus mouse 2,633,209
Rattus sp. rat 351,827
Glycine max soybean 268,299
Drosophila melanogaster fruit fly 256,583
Danio rerio zebrafish 255,407
Hordeium vulgare + subsp. barley 240,877
Bos taurus cattle 235,495
Xenopus laevis African clawed frog 220,504
Triticum aestivum wheat 198,800

 

 

 

 

 

 

 

 

 

 

Table 2-4. Organisms represented in UniGene (http://www.ncbi.nlm.nih.gov/UniGene/) November, 2002.
Organism Abbreviation Common name
Arabidopsis thaliana At thale cress
Bos taurus Bt cow
Danio rerio Dr zebrafish
Homo sapiens Hs human
Hordeum vulgare Hv barley
Mus musculus Mm mouse
Oryza sativa Os rice
Rattus norvegicus Rn rat
Triticum aestivum Ta wheat
Xenopus laevis Xl clawed frog
Zea mays Zm corn (maize)

 

 

 

 

 

 

 

 

 

Table 2-5. Organisms from which sequence-tagged sites have been obtained(http://www.ncbi.nlm.nih.gov/genome/sts/unists_stats.html)(November, 2002).
organism number of STSs
Homo sapiens (human) 134,000
Rattus norvegicus (Norway rat) 30,000
Mus musculus (house mouse) 27,000
Danio rerio (zebrafish) 22,000
Drosophila melanogaster (fruit fly) 1,100

 

 
 
 
 
Table 2-6. Organisms from which genome survey sequences have been obtained (http://www.ncbi.nlm.nih.gov/dbGSS/dbGSS_summary.html)(November, 2002).
organism Approximate number of STSs
Mus musculus (house mouse) 944,000
Homo sapiens (human) 872,000
Brassica oleracea (vegetable) 469,000
Rattus norvegicus (Norway rat) 307,000
Arabidopsis thaliana (thale cress) 191,000
Tetraodon nigroviridis (pufferfish) 189,000

 

Box 2-1. Types of accession numbers. Modified from http://www.ncbi.nlm.nih.gov/entrez/query/static/help/helpdoc.html.
Type of Record Sample Accession Format
GenBank/EMBL/DDBJ Nucleotide Sequence Records

One letter followed by five digits, e.g.:X02775. Two letters followed by six digits, e.g.: AF025334

GenPept Sequence Records (which contain the amino acid translations from GenBank/EMBL/DDBJ records that have a coding region feature annotated on them) Three letters and five digits, e.g.:AAA12345
Protein Sequence Records from SwissProt and PIR Usually one letter and five digits, e.g.: P12345. SwissProt numbers may also be a mixture of numbers and letters.
Protein Sequence Records from PRF A series of digits (often six or seven) followed by a letter, e.g.: 1901178A
RefSeq Nucleotide Sequence Records Two letters, and underscore bar, and six digits, e.g.: mRNA records (NM_*): NM_006744 genomic DNA contigs (NT_*): NT_008769
RefSeq Protein Sequence Records Two letters (NP), and underscore bar, and six digits, e.g.: NP_006735
Protein Structure Records PDB accessions generally contain one digit followed by three letters, e.g.: 1TUP. They may contain other mixtures of numbers and letters (or numbers only). MMDB ID numbers generally contain four digits, e.g.: 3973

 

Table 2-7. Formats of accession numbers for RefSeq entries.
Molecule Accession Format Genome
Complete Genome NC_###### Archaea, Bacterial, Organelle, Virus
Complete Chromosome NC_###### Eukaryote
Complete Sequence NC_###### Plasmid
Genomic Contig NT_###### Homo sapiens
mRNA NM_###### Limited Vertebrate: human, mouse, rat
Protein NP_###### All of the above
 
 
 
 
 
 
Table 2-8. Databases containing nucleotide and protein information
GenBank
DNA Database of Japan (DDBJ)
EMBL/EBI (European Molecular Biology Lab/European Bioinformatics Institute)
Ensembl
Protein Information Resource
SRS at ExPASy
SRS at DDBJ

 

 

 

 

 

 

Table 2-9. Databases containing literature information
PubMed
Public Library of Science

 

 

Return to Contents