![]() |
Learn |
|
DRAGON
Database
:
Annotate
| Search
| Compare
| Learn
| Download
| Order
| Contact
Us
| Links
| Logs
& Bugs
DRAGON View : Families | Order | Paths DRAGON Map The Pevsner Laboratory |
| |
| |
| |
| |
| |
| |
Overview
of DRAGON
Microarray technology allows for the detection
of the expression of thousands of genes simultaneously. Once a microarray
experiment has been performed and data have been generated, it is often
difficult to know what the biological characteristics of all the genes
(and their encoded proteins) are in the experiment. The task of researching
individual genes on the world wide web can be time-consuming and limited
in scope. Therefore, the development of a bioinformatics
tool that is capable of defining all relevant biological information for
all of the genes in a microarray experiment simultaneously is needed.
We have developed "Database Referencing of Array Genes ONline" or "DRAGON". DRAGON is freely available. DRAGON can rapidly supply information pertaining to a range of the biological characteristics of a majority of the genes in any microarray data set. The subsequent inclusion of this information during the analysis of microarray data allows for deeper insight into gene expression patterns. Once you have used the DRAGON Database to annotate your data, try out the information visualization tools provided by DRAGON View in order to further analyze your data. You can access and query the DRAGON database and the Dragon View tools via this web site. Below are instructions for how to use the different features of DRAGON.
Map
of DRAGON Tools
The image below represents an overview or
flow diagram of how a user can use the tools available on the DRAGON web
site. Click on the images or text below in order to find out more about
how to use each tool.
|
Dragon
Database
|
|
|
Annotate
|
"I would like to analyze my microarray data in relation to specific biological characteristics. How do I add that information to all of my microarrays genes simultaneously?" |
|
Search
|
"I would like to know whether keratin genes have ever been found to be expressed in brain tissue." |
|
Compare
|
"I am interested in cytokines and would like to perform a microarray experiment, which microarray platform has the cytokines I am interested in?" |
|
Learn
|
"I don't yet understand what DRAGON is, why I want to use it or how it can help in my microarray data analysis." |
|
Download
|
"I know how to use databasing systems and I want to use DRAGON on my own computer." |
|
Order
|
"I know how to use databasing systems and I want to use DRAGON on my own computer but I have a really slow connection to the internet." |
|
Dragon
View
|
|
|
View
|
"I used Dragon to derive information about my genes but now I want to be able to view relationships between my expression data and the biological characteristics of my genes." |
|
Dragon
Map
|
|
|
Explore
|
"I am interested in how human genes are distributed across tissues and chromosomes and how those genes are shared between different tissues." |
|
Other
|
|
|
Contact
|
"You misspelled something." or "My query didn't work, why?" or "Dragon is really helpful!" |
DRAGON
Database
The DRAGON Database integrates data publicly
available in a number of other web-accessible databases. These other databases
include Unigene, SWISS-PROT, Pfam and the Kyoto Encyclopedia of Genes and
Genomes (KEGG). Perl scripts are used to download and parse the information
present in large flat-files provided by each of the databases (you can view
the home page for info on exactly where those original
flat-files are). Once resident in the DRAGON Database users can query the
information in a couple of different ways (i.e. Annotate
and Search). Accession numbers (not similarity searches)
are used to integrate the information available in the DRAGON Database. For
example, when a user requests certain types of information related to the
GenBanks numbers in their list, an SQL statement is dynamically generated
that contains the necessary statements to string together the desired information
and send it back to the user. The Schema below provides
an example of how this is done.
Presently, you can Annotate
any kind of data that you would like. It can be either radioactivity-based
or two-color fluorescent based microarray data, it can be SAGE data, it can
be differential display data, it can be subtractive hybridization data. It
doesn't matter. However, the DRAGON View tools are presently designed to work
with two color microarray data. In other words, your expression data should
be in the form of ratio expression values where positive values arbitrarily
mean up-regulated in one sample (i.e. the "treated" or "disease"
sample) and negative values mean down-regulated in that sample versus the
"control" sample.
Schema
When a list of Genbank numbers are entered into
the Annotate page and additional information is requested (e.g. SWISS-PROT
Accession #'s or Pfam Names), accession numbers are used to link together
all of that information. In other words, similarity searching algorithms are
not implemented for every single Genbank number. An example of the way that
accession numbers are linked together is provided in the schema diagram below.
In this schema Incyte data is joined together with tables containing information
from numerous other database sources. Depending upon what sorts of information
a user requests when they use the Annotate tool, these types of joins (as
SQL statements) are dynamically generated by the Perl CGI script and passed
to the DRAGON database which is running within the MySQL database management
system (DBMS). The information extracted from the DRAGON database is passed
back to the Perl
CGI script for appropriate output (i.e. html, text or email). So, if you would
like to download all of the flat-files that we
use in the DRAGON database, then you are also going to need to know how to
link together all of the information in those files. The schema below will
give you the general idea.
NOTE: One important aspect of this schema to
note is that the "Unigene Numbers" table is used twice (bottom left-hand
corner of the schema) as "Unigene Numbers" and "Unigene Numbers_1".
These are actually two replicates of the exact same table. The join from the
"Incyte Numbers" table integrates the "Genbank #s" field
from that table with the
"Genbank #s" field from the "Unigene Numbers" table. Then
the "Unigene ID" field from one replicate "Unigene Numbers"
table is integrated with the same field from the second replicate "Unigene
Number_1" table. Finally the "Genbank #s" field from the second
"Unigene Numbers_1" table can be joined with Genbank #s field from
other tables. The key here is that this expands any given Genbank # from the
user input file, into a list of every single Genbank # associated with a given
Unigene ID. Each Unigene ID identifies clusters of Genbank #s that are (presumably)
fragments of a single cDNA. The "Swissprot Numbers" table only provides
a few Genbank #s (at most) for any given Swissprot #, therefore, starting
with a list of every single Genbank # associated with a given cDNA is crucial
to bridging the gap between Unigene and SWISS-PROT.
Currently, only parts of this schema are implemented
in the Annotate tool on the web site. For example, the Trembl, Transfac and
Interpro tables are not available but hopefully (time willing) will be available
shortly.
For
anyone who has used Microsoft Access before, you might recognize the look
of the schema below. It was generated in MS Access. MS Access is certainly
not the fastest, most robust or most well-featured database management system
(DBMS) out there, however, if you like you can download all of the data that
is used to create the DRAGON database and use MS Access to play with the data
and get a sense of how everything fits together.

Annotate
NOTE:
It is important to realize that the Annotation of large data sets (i.e. ~>1000
rows) can take a while (i.e. 15 minutes). Therefore, if you plan on using
the Annotate tool for queries containing ~>1000 GenBank numbers then make
sure to upload your data (do not paste it in) and have the data emailed to
you (do not have it output as a tab-delimited text file or an html page).
Also, the ~>1000 number is not always the threshold you will want to use.
If you plan on pasting data into the text area for annotation, MAKE SURE TO
CHECK THAT THE LAST ROW THAT SHOWS UP IN THE TEXTAREA IS THE PROPER LAST ROW
OF YOUR DATA SET! If you enter too much data into the text area, it can fill
up and if so it simply chops off any remaining data without telling you that
it has done so.
1) Paste a delimited text file (tab-delimited
is the default) into the "Data Entry" field, or upload your file.
Click here to see an example of the type
of tab-delimited text file you might paste into the field below. Or you can
view the example data sets provided on the Annotate page to get an idea of
the general form your data should take.
2) This text file should contain at least two columns. One column should
be your expression data (e.g. differential ratio values or absolute intensity
values, it doesn't matter for Annotation, however, the tools in DRAGON View
are geared toward analysis of ratio data). The other column should be the
Genbank accession numbers that were provided
with your microarray data. Presently, DRAGON can only take GenBank numbers.
(NOTE: some microarray data sets are provided with proprietary accession numbers.
In these cases, the company which produced the microarray should also provide
on their web site a table which contains Genbank accession numbers that correspond
with each of their proprietary accession numbers. If this is the case then
you need to integrate the Genbank accession numbers with your data before
you can use DRAGON).
3) Long lists and searches that request more types of information will
take more time and may take so long that your internet browser (i.e. Netscape
or Internet Explorer) thinks that no data was returned and will time-out.
If this happens, try the email option. DRAGON will email you your results
whenever it gets done with them.
4) You need to tell DRAGON which column contains the Genbank accession
numbers by typing the number for the column into the "Column number containing
GenBank numbers" text box. As you might expect, your farthest left column
would be number 1.
5) Next, define what sorts of information you would like DRAGON to
append to your information by checking specific checkboxes from the "Unigene
Info:", "Swissprot Info:" and "PFAM Info:" sections
of the table. You can choose information from more than one database. For
example, you could check all the check boxes and DRAGON would give you all
the information it has about each of your microarray genes.
6) Click the "Submit Gene List" button.
7) DRAGON will add the information that you choose as checked items
as new columns appended to the end of the table that you provided. You have
to choose the type of output file that you would like DRAGON to provide.
8)
Tip: If you wish to annotate your data with types of information that
categorize multiple genes (i.e. Pfam numbers, Swissprot Keywords or KEGG numbers),
then it is best to only annotate your data with one of these types of information
at a time.
An example of a tab-delimited text file that you could paste into DRAGON.
| Genbank Accession Number | Ratio (Cy3/Cy5) | Cy3 Intensity | Cy5 Intensity | Gene Name |
| AL036211 | 2.8 | 2423 | 800 | lumican |
| NM_001797 | 2.4 | 867 | 323 | cadherin 11 (OB-cadherin, osteoblast) |
| NM_000700 | 2.3 | 795 | 320 | annexin A1 |
| AW157548 | 2.2 | 5193 | 2128 | insulin-like growth factor binding protein 5 |
This
is an example of the format of a file you could enter into the text field
on the Annotate page. The easiest way to generate a tab-delimited text file
(if you don't already have one) is to paste your data into a spreadsheet program
(i.e. Microsoft Excel) and then "Save as..." a "Tab-delimited
text file" which should have a .txt as a file extension (if you are using
a PC). The Genbank Accession Number (RED TEXT)
is what DRAGON will use to append other sorts of information to your table.
You will need to define which column contains Genbank Accession numbers after
you pasted your data into the text field. The type of expression data (GREEN
TEXT) that you have will obviously vary depending upon the type of
microarray that you have used. For example, if you have used a radioactivity
based system, then you won't have Cy3 and Cy5 intensity data. The key to the
type of expression data that you enter into DRAGON is that it is sufficient
for a full analysis of your microarray experiment in relation to the types
of information you derive from DRAGON. Finally, you can add other sorts of
information to your table, such as the gene names (BLUE
TEXT) you were provided in your microarray data set. (NOTE:
If nothing is being returned in your searches or you are getting strange errors,
try removing any extraneous information columns, such as names. It is possible
that particular characters and white spaces in your names could be altering
you search).
After
DRAGON Annotate has finished annotating your data with whatever sorts of information
you requested, what it will give back to you is simply your same delimited
text file with columns added to the end of the file. So, for example, if you
requested that the table above be annotated with chromosomal location information
and Medline Reference Numbers then this is what the text file that you would
get back from DRAGON would look like,
| Genbank Accession Number | Ratio (Cy3/Cy5) | Cy3 Intensity | Cy5 Intensity | Gene Name | Chromosomal Location | Medline Numbers |
| AL036211 | 2.8 | 2423 | 800 | lumican | 12q21.3-q22 | 96047334 |
| NM_001797 | 2.4 | 867 | 323 | cadherin 11 (OB-cadherin, osteoblast) | 16q22.1 | 91283540 |
| NM_000700 | 2.3 | 795 | 320 | annexin A1 | 9q12-q21.2 | 99115644 |
| AW157548 | 2.2 | 5193 | 2128 | insulin-like growth factor binding protein 5 | 2q33-q36 | 99043863 |
If
you annotate you gene list with certain types of classifying information,
specifically, SWISS-PROT keywords, Pfam numbers, Pfam names or Kegg numbers,
and there are more than one classifiers associated with a given gene (i.e.
cadherin V has five different keywords associated with it), then multiple
rows are generated. Each row is going to contain the duplicate gene information
except for the Pfam number or SWISS-PROT keyword. The
output was formatted this way so that you could then take that output text
file and sort by the classifying information. What this does is it provides
you with families of related genes.
A
sample data set you can use on the annotation page.
You can use this tab-delimited
text file which contains a list of Genbank numbers to experiment with
the different features of DRAGON. Click on the link for the file, select all
contents of the file, copy and paste into the Data Entry field on the annotation
page. Annotate the genes as you wish.
One thing that you will encounter if you annotate this
list with certain types of data is that some of the genes are repeated a number
of times. This is due to the fact
that that gene is associated with more than one type of a certain criteria
you have chosen. For example, one gene and its associated protein can have
numerous keywords definitions. Therefore, DRAGON repeats the gene on numerous
lines and provides a different keyword on each line. This way, you can use
the keywords to sort your data in a spreadsheet program such as MS Excel.
If a gene had two keywords associated with it and you sorted your whole list
by keywords, that gene would be in two different places on your list clustered
with all other genes also associated with each of those keywords. Furthermore,
you can take your output and plug it into the suite of DRAGON
View information visualization tools.
Search
1) First, choose the database
that you want to search by clicking the radio button (
) associated with one of the databases. At the moment you can't search more
than one database simultaneously.
2) Next, you can type any characteristic that you are interested in
into the text boxes at the right of the table.
3) Then, you can check off any characteristic that you want to have
sent back to you at the left of the table.
4) Then click the "Submit Query" button.
5) DRAGON will search for genes or proteins only when you have entered
a characteristic and have checked its corresponding box. DRAGON will also
provide any other information that you have checked but not provided with
a search term.
An example of a DRAGON search.
For example, you select "Unigene:"
by clicking in the radio button (
) to the left of it. Then you type "keratin" into the name field
at the right of the table and you check the checkbox (
) to the left of "Find gene by name:". Then, even though you only
want to search for keratins, you would like to know what the chromosomal cytoband
location of each keratin you find is. Therefore, you check the checkbox to
the left of "Find gene by cytoband:" but you don't enter anything
into the text box at the right of the table. DRAGON will only search for keratins,
but it will provide all information it has about the cytoband location of
each keratin it finds.
DRAGON
View
The
DRAGON View tools are being developed as a companion to the DRAGON Database
Annotation feature. Once you have annotated your data, you might want to be
able to visualize whether families of related genes are all regulated in a
similar manner or whether genes in the same cellular pathway are all differentially
regulated. The DRAGON View tools include DRAGON Families, DRAGON Order and
DRAGON Paths. Each tool is implemented as a Perl CGI script that dynamically
generates a visual output based upon the data set a user has submitted for
analysis.
DRAGON
Families
DRAGON
Families integrates two pieces of information that you provide. The first
is the ratio expression data which you derive from fluorescent based Cy3/Cy5
microarray experiments. The second is the "type" information which
you derive from the DRAGON database Annotate tool. Type information can be
many things. Presently, Pfam numbers, Swissprot keywords and KEGG numbers
are the most useful type information which DRAGON provides. The Instructions
on the DRAGON Families page will guide you through the process of entering
data. View the example data sets if you have questions about what your data
should look like. NOTE: Unlike the Annotate tool, it
is best to use comma-delimited text files with the DRAGON View tools.
The biggest confusion with DRAGON Families comes
with the output. What exactly does it mean? Here is a diagram and brief explanation.

Each
box in the diagram above represents one gene (as defined by the Unigene database).
If you click on any box, you will be hyperlinked to the Unigene cluster which
corresponds to that gene. The color of each box represents the expression
of each of your genes.
Dark Red = 0 < X < 2.0
Bright Red = X >= 2.0
Dark
Green = 0 > X > -2.0
Bright Green = X <= -2.0
The black text after each row of boxes indicates the type of all of the genes in that group. It is important to realize the fact that all of the boxes on any given row are ALL OF THE GENES IN YOUR DATA THAT HAVE THAT TYPE. Therefore, large numbers of genes which are all in the same group and are all up or down regulated are potentially interesting. (We are currently implementing statistical tests for the significance of this type of result that will be available as part of your output shortly). If you have indicated the type information you are using correctly while inputting your data, then if you click on the hyperlink you should be linked to the proper description of that type. Finally, the blue number in parentheses is the average ratio expression value for all of the genes in that group.
DRAGON
Order
DRAGON
Order is another information visualization tool which attempts to get at the
same question addressed by DRAGON Families but from a different angle. The
main difference between the two tools is that DRAGON Order automatically pre-sorts
data by ratio expression values. Here is a diagram and description of the
output.
+ -
Each row of yellow lines in the picture above is representative of the entire list of genes which you entered. Each row is defined by a type written in white letters to the right. The position of each yellow line in each row indicates that there is a gene which belongs to the type that defines that row. So, for example, the first row in the picture above is "Transmembrane." The "transmembrane" keyword is a rather broad category, therefore, there are a large number of yellow lines in this row. Where ever there is a yellow line in the row, that means that a gene which encodes a protein which has a transmembrane domain is present. The key is that, because your list of genes is sorted by their ratio expression values before you entered them into DRAGON Order, the position of each yellow line is indicative of the expression level of that gene (the + (up-regulated) and - (down-regulated) signs at the top of the picture are indicative of the expression levels across the data). Therefore, an equal distribution of yellow lines across the whole row means that there is no significant co-expression of a set of genes in that group. However, clusters of yellow lines at either the far left or the far right of any given row is interesting because it means that a set of related genes are all up or down regulated. For example, four of the five "Cell Adhesion" genes are clustered to the left of the row.
DRAGON
Paths
The
concept of DRAGON Paths is relatively straight-forward. DRAGON Paths uses
diagrams of cellular pathways directly downloaded from the Kyoto Encyclopedia
for Genes and Genomes (KEGG) database in order to map the users gene expression
values onto these cellular pathway diagrams. The idea is that by viewing the
expression levels derived from microarray data within the context of cellular
pathways, the user might be able to detect patterns of expression that might
not otherwise be apparent. Conveniently, KEGG provides a coordinate file for
every single cellular pathway diagram in their database. The coordinate files
(you can view these files by following the link on our home
page to the *_gene_coord files at the KEGG ftp site) provide x,y coordinates
on each diagram for every protein product in the diagram along with the Locuslink
number of every gene corresponding to those Locuslink numbers. DRAGON Paths
simply uses the Locuslink numbers provided by the user to map the expression
values for each of the user's Locuslink numbers with the corresponding coordinate
location of that Locuslink number on any of the chosen KEGG cellular pathway
diagrams. Here is an example of a DRAGON Paths output.
Each green box in the diagram is representative of a human protein (see the KEGG web site for more information on the configuration of the KEGG pathway diagrams and why some boxes are green and some boxes are white). The numbers in the boxes are the EC number for the proteins. Red or green circles are placed in the upper left corner of each protein that is found in your data. The color of the circle is indicative of expression level. Red means up-regulated, green means down regulated. Each green box is hyperlinked to the Locuslink entry for that protein.
General
Remarks
1) It is usually
best to only request one type of information from DRAGON at a time. For example,
if you want to know about KEGG #'s, Pfam #'s and SWISS-PROT Keywords for your
gene list, generate three different queries each requesting one of the three
different types of information.
Annotate
Page
1) If one of the characteristics you choose to associate with a list
of Genbank numbers doesn't exist for some of the Genbank numbers on that list,
you may find that all of the rest of the data for that gene is also missing.
This is a programming error that we are currently addressing.
2) You may find that one of your Genbank numbers is being improperly
associated with other sorts of information such as the wrong Swissprot and
Pfam numbers. This is usually due to the existence of certain Genbank numbers
that refer to very large pieces of DNA such as BAC and YAC sequences. BAC
sequences, for example, can contain many genes and therefore that Genbank
number can be clustered with more then one Unigene Cluster ID. Since DRAGON
uses Unigene Cluster ID's, your Genbank number can be associated with incorrect
information because of the existence of these large Genbank numbers. We are
currently correcting this problem in the online version of DRAGON.
3) If you are getting strange errors in the output you receive from
DRAGON. For example, if you are getting nothing back but know that there should
be some output, or if the names of the genes are being put in the "Genbank
Number" column and nothing else is output, try only inputting a column
of Genbank numbers, expression values and nothing else. White spaces and other
characters in columns containing names for example can confuse the DRAGON
search engine. In fact if you are getting any sorts of strange errors try
just inputting two columns, Genbank numbers and expression data. This is a
programming error that we are currently addressing.
4) If you ask for just the Unigene Cluster ID for a list of Genbank
numbers, you may get nothing in return. Try asking for some other data as
well, such as the Locuslink number and you should now get back both the Unigene
number and the Locuslink number. This is a programming error that we are currently
addressing.
5)
Any data set that contains
over ~1000 GenBank numbers is going to take awhile and the email option should
be used as the output option with these types of data sets.
Search
Page
1) Your searches are taking a really long time. Sometimes, they take
so long that your browser times out and says that there was no data returned
from the server. This is an indexing problem in the database that is currently
being addressed.
2) You can't search for characteristics across databases, such as things
on chromosome 10 (Unigene) that have a EF-hand domain (Pfam). At present,
you can only search within individual databases.
3) You got a blank screen as a result. You may not have checked the
proper radio button for the database you would like to search. For example,
if you want to search for something in Swissprot and you enter information
into one of the fields (i.e. the Keyword field), you then have to also click
the radio button to the left of the name "Swissprot" at the top
left of the Swissprot section of the table.
Frequently
Asked Questions
Question: Can I use other
sorts of accession numbers, such as Tigr #'s or SWISS-PROT #'s as my entry
accession # instead of GenBank #'s on the Annotate page?
Answer:
No, currently only GenBank numbers can be used as entry accession numbers
on the Annotate page.
Question:
Where do you get the information that is present in DRAGON?
Answer:
The information present in DRAGON is extracted from a specific set of flat-files
that are provided by the public databases integrated in DRAGON. You can view
each of these flat-files by going to the DRAGON home
page and looking in the bottom right hand section. A list of every single
flat-file used in DRAGON and a link directly to it is provided.
Question:
How often do you update DRAGON?
Answer:
The DRAGON database is updated weekly (Sunday morning).
Question:
Can I have a copy of the Perl scripts used to implement DRAGON?
Answer:
Not presently, however, all of the data used in every single DRAGON tool,
whether it is Annotate or Search etc. is available for download
via this web-site.
Copyright
2001 Kennedy Krieger Institute