Table of Contents

 Map of DRAGON Tools
 Examples of Data Input
 DRAGON Database
 Frequently Asked Questions
 References for DRAGON View

Overview of DRAGON
     Microarray technology allows for the detection of the expression of thousands of genes simultaneously. Once a microarray experiment has been performed and data have been generated, it is often difficult to know what the biological characteristics of all the genes (and their encoded proteins) are in the experiment. The task of researching individual genes on the world wide web can be time-consuming and limited in scope. Therefore, the development of a bioinformatics tool that is capable of defining all relevant biological information for all of the genes in a microarray experiment simultaneously is needed.

     We have developed "Database Referencing of Array Genes ONline" or "DRAGON". DRAGON is freely available. DRAGON can rapidly supply information pertaining to a range of the biological characteristics of a majority of the genes in any microarray data set. The subsequent inclusion of this information during the analysis of microarray data allows for deeper insight into gene expression patterns. Once you have used the DRAGON Database to annotate your data, try out the information visualization tools provided by DRAGON View in order to further analyze your data. You can access and query the DRAGON database and the Dragon View tools via this web site. Below are instructions for how to use the different features of DRAGON.

Map of DRAGON Tools
     The image below represents an overview or flow diagram of how a user can use the tools available on the DRAGON web site. Click on the images or text below in order to find out more about how to use each tool.

Examples of Data Input

Dragon Database
"I would like to analyze my microarray data in relation to specific biological characteristics. How do I add that information to all of my microarrays genes simultaneously?"
"I would like to know whether keratin genes have ever been found to be expressed in brain tissue."
"I am interested in cytokines and would like to perform a microarray experiment, which microarray platform has the cytokines I am interested in?"
"I don't yet understand what DRAGON is, why I want to use it or how it can help in my microarray data analysis."
"I know how to use databasing systems and I want to use DRAGON on my own computer."
"I know how to use databasing systems and I want to use DRAGON on my own computer but I have a really slow connection to the internet."
Dragon View
"I used Dragon to derive information about my genes but now I want to be able to view relationships between my expression data and the biological characteristics of my genes."
Dragon Map
"I am interested in how human genes are distributed across tissues and chromosomes and how those genes are shared between different tissues."
"You misspelled something." or "My query didn't work, why?" or "Dragon is really helpful!"

DRAGON Database
     The DRAGON Database integrates data publicly available in a number of other web-accessible databases. These other databases include Unigene, SWISS-PROT, Pfam and the Kyoto Encyclopedia of Genes and Genomes (KEGG). Perl scripts are used to download and parse the information present in large flat-files provided by each of the databases (you can view the home page for info on exactly where those original flat-files are). Once resident in the DRAGON Database users can query the information in a couple of different ways (i.e. Annotate and Search). Accession numbers (not similarity searches) are used to integrate the information available in the DRAGON Database. For example, when a user requests certain types of information related to the GenBanks numbers in their list, an SQL statement is dynamically generated that contains the necessary statements to string together the desired information and send it back to the user. The Schema below provides an example of how this is done.
     Presently, you can Annotate any kind of data that you would like. It can be either radioactivity-based or two-color fluorescent based microarray data, it can be SAGE data, it can be differential display data, it can be subtractive hybridization data. It doesn't matter. However, the DRAGON View tools are presently designed to work with two color microarray data. In other words, your expression data should be in the form of ratio expression values where positive values arbitrarily mean up-regulated in one sample (i.e. the "treated" or "disease" sample) and negative values mean down-regulated in that sample versus the "control" sample.

     When a list of Genbank numbers are entered into the Annotate page and additional information is requested (e.g. SWISS-PROT Accession #'s or Pfam Names), accession numbers are used to link together all of that information. In other words, similarity searching algorithms are not implemented for every single Genbank number. An example of the way that accession numbers are linked together is provided in the schema diagram below. In this schema Incyte data is joined together with tables containing information from numerous other database sources. Depending upon what sorts of information a user requests when they use the Annotate tool, these types of joins (as SQL statements) are dynamically generated by the Perl CGI script and passed to the DRAGON database which is running within the MySQL database management system (DBMS). The information extracted from the DRAGON database is passed back to the
Perl CGI script for appropriate output (i.e. html, text or email). So, if you would like to download all of the flat-files that we use in the DRAGON database, then you are also going to need to know how to link together all of the information in those files. The schema below will give you the general idea.
     NOTE: One important aspect of this schema to note is that the "Unigene Numbers" table is used twice (bottom left-hand corner of the schema) as "Unigene Numbers" and "Unigene Numbers_1". These are actually two replicates of the exact same table. The join from the "Incyte Numbers" table integrates the "Genbank #s" field from that table with
the "Genbank #s" field from the "Unigene Numbers" table. Then the "Unigene ID" field from one replicate "Unigene Numbers" table is integrated with the same field from the second replicate "Unigene Number_1" table. Finally the "Genbank #s" field from the second "Unigene Numbers_1" table can be joined with Genbank #s field from other tables. The key here is that this expands any given Genbank # from the user input file, into a list of every single Genbank # associated with a given Unigene ID. Each Unigene ID identifies clusters of Genbank #s that are (presumably) fragments of a single cDNA. The "Swissprot Numbers" table only provides a few Genbank #s (at most) for any given Swissprot #, therefore, starting with a list of every single Genbank # associated with a given cDNA is crucial to bridging the gap between Unigene and SWISS-PROT.
     Currently, only parts of this schema are implemented in the Annotate tool on the web site. For example, the Trembl, Transfac and Interpro tables are not available but hopefully (time willing) will be available shortly.
For anyone who has used Microsoft Access before, you might recognize the look of the schema below. It was generated in MS Access. MS Access is certainly not the fastest, most robust or most well-featured database management system (DBMS) out there, however, if you like you can download all of the data that is used to create the DRAGON database and use MS Access to play with the data and get a sense of how everything fits together.


NOTE: It is important to realize that the Annotation of large data sets (i.e. ~>1000 rows) can take a while (i.e. 15 minutes). Therefore, if you plan on using the Annotate tool for queries containing ~>1000 GenBank numbers then make sure to upload your data (do not paste it in) and have the data emailed to you (do not have it output as a tab-delimited text file or an html page). Also, the ~>1000 number is not always the threshold you will want to use. If you plan on pasting data into the text area for annotation, MAKE SURE TO CHECK THAT THE LAST ROW THAT SHOWS UP IN THE TEXTAREA IS THE PROPER LAST ROW OF YOUR DATA SET! If you enter too much data into the text area, it can fill up and if so it simply chops off any remaining data without telling you that it has done so.

Paste a delimited text file (tab-delimited is the default) into the "Data Entry" field, or upload your file. Click here to see an example of the type of tab-delimited text file you might paste into the field below. Or you can view the example data sets provided on the Annotate page to get an idea of the general form your data should take.
2) This text file should contain at least two columns. One column should be your expression data (e.g. differential ratio values or absolute intensity values, it doesn't matter for Annotation, however, the tools in DRAGON View are geared toward analysis of ratio data). The other column should be the Genbank accession numbers that were provided with your microarray data. Presently, DRAGON can only take GenBank numbers. (NOTE: some microarray data sets are provided with proprietary accession numbers. In these cases, the company which produced the microarray should also provide on their web site a table which contains Genbank accession numbers that correspond with each of their proprietary accession numbers. If this is the case then you need to integrate the Genbank accession numbers with your data before you can use DRAGON).
3) Long lists and searches that request more types of information will take more time and may take so long that your internet browser (i.e. Netscape or Internet Explorer) thinks that no data was returned and will time-out. If this happens, try the email option. DRAGON will email you your results whenever it gets done with them.
4) You need to tell DRAGON which column contains the Genbank accession numbers by typing the number for the column into the "Column number containing GenBank numbers" text box. As you might expect, your farthest left column would be number 1.
5) Next, define what sorts of information you would like DRAGON to append to your information by checking specific checkboxes from the "Unigene Info:", "Swissprot Info:" and "PFAM Info:" sections of the table. You can choose information from more than one database. For example, you could check all the check boxes and DRAGON would give you all the information it has about each of your microarray genes.
6) Click the "Submit Gene List" button.
7) DRAGON will add the information that you choose as checked items as new columns appended to the end of the table that you provided. You have to choose the type of output file that you would like DRAGON to provide.
8) Tip: If you wish to annotate your data with types of information that categorize multiple genes (i.e. Pfam numbers, Swissprot Keywords or KEGG numbers), then it is best to only annotate your data with one of these types of information at a time.

An example of a tab-delimited text file that you could paste into DRAGON.

Genbank Accession Number Ratio (Cy3/Cy5) Cy3 Intensity Cy5 Intensity Gene Name
AL036211 2.8 2423 800 lumican
NM_001797 2.4 867 323 cadherin 11 (OB-cadherin, osteoblast)
NM_000700 2.3 795 320 annexin A1
AW157548 2.2 5193 2128 insulin-like growth factor binding protein 5

     This is an example of the format of a file you could enter into the text field on the Annotate page. The easiest way to generate a tab-delimited text file (if you don't already have one) is to paste your data into a spreadsheet program (i.e. Microsoft Excel) and then "Save as..." a "Tab-delimited text file" which should have a .txt as a file extension (if you are using a PC). The Genbank Accession Number (RED TEXT) is what DRAGON will use to append other sorts of information to your table. You will need to define which column contains Genbank Accession numbers after you pasted your data into the text field. The type of expression data (GREEN TEXT) that you have will obviously vary depending upon the type of microarray that you have used. For example, if you have used a radioactivity based system, then you won't have Cy3 and Cy5 intensity data. The key to the type of expression data that you enter into DRAGON is that it is sufficient for a full analysis of your microarray experiment in relation to the types of information you derive from DRAGON. Finally, you can add other sorts of information to your table, such as the gene names (BLUE TEXT) you were provided in your microarray data set. (NOTE: If nothing is being returned in your searches or you are getting strange errors, try removing any extraneous information columns, such as names. It is possible that particular characters and white spaces in your names could be altering you search).
     After DRAGON Annotate has finished annotating your data with whatever sorts of information you requested, what it will give back to you is simply your same delimited text file with columns added to the end of the file. So, for example, if you requested that the table above be annotated with chromosomal location information and Medline Reference Numbers then this is what the text file that you would get back from DRAGON would look like,

Genbank Accession Number Ratio (Cy3/Cy5) Cy3 Intensity Cy5 Intensity Gene Name Chromosomal Location Medline Numbers
AL036211 2.8 2423 800 lumican 12q21.3-q22 96047334
NM_001797 2.4 867 323 cadherin 11 (OB-cadherin, osteoblast) 16q22.1 91283540
NM_000700 2.3 795 320 annexin A1 9q12-q21.2 99115644
AW157548 2.2 5193 2128 insulin-like growth factor binding protein 5 2q33-q36 99043863

     If you annotate you gene list with certain types of classifying information, specifically, SWISS-PROT keywords, Pfam numbers, Pfam names or Kegg numbers, and there are more than one classifiers associated with a given gene (i.e. cadherin V has five different keywords associated with it), then multiple rows are generated. Each row is going to contain the duplicate gene information except for the Pfam number or SWISS-PROT keyword. The output was formatted this way so that you could then take that output text file and sort by the classifying information. What this does is it provides you with families of related genes.

A sample data set you can use on the annotation page.
     You can use this tab-delimited text file which contains a list of Genbank numbers to experiment with the different features of DRAGON. Click on the link for the file, select all contents of the file, copy and paste into the Data Entry field on the annotation page. Annotate the genes as you wish. One thing that you will encounter if you annotate this list with certain types of data is that some of the genes are repeated a number of times. This is due to the fact that that gene is associated with more than one type of a certain criteria you have chosen. For example, one gene and its associated protein can have numerous keywords definitions. Therefore, DRAGON repeats the gene on numerous lines and provides a different keyword on each line. This way, you can use the keywords to sort your data in a spreadsheet program such as MS Excel. If a gene had two keywords associated with it and you sorted your whole list by keywords, that gene would be in two different places on your list clustered with all other genes also associated with each of those keywords. Furthermore, you can take your output and plug it into the suite of DRAGON View information visualization tools.

1) First, choose the database that you want to search by clicking the radio button ( ) associated with one of the databases. At the moment you can't search more than one database simultaneously.
2) Next, you can type any characteristic that you are interested in into the text boxes at the right of the table.
3) Then, you can check off any characteristic that you want to have sent back to you at the left of the table.
4) Then click the "Submit Query" button.
5) DRAGON will search for genes or proteins only when you have entered a characteristic and have checked its corresponding box. DRAGON will also provide any other information that you have checked but not provided with a search term.

An example of a DRAGON search.
     For example, you select "Unigene:" by clicking in the radio button ( ) to the left of it. Then you type "keratin" into the name field at the right of the table and you check the checkbox ( ) to the left of "Find gene by name:". Then, even though you only want to search for keratins, you would like to know what the chromosomal cytoband location of each keratin you find is. Therefore, you check the checkbox to the left of "Find gene by cytoband:" but you don't enter anything into the text box at the right of the table. DRAGON will only search for keratins, but it will provide all information it has about the cytoband location of each keratin it finds.

The DRAGON View tools are being developed as a companion to the DRAGON Database Annotation feature. Once you have annotated your data, you might want to be able to visualize whether families of related genes are all regulated in a similar manner or whether genes in the same cellular pathway are all differentially regulated. The DRAGON View tools include DRAGON Families, DRAGON Order and DRAGON Paths. Each tool is implemented as a Perl CGI script that dynamically generates a visual output based upon the data set a user has submitted for analysis.

DRAGON Families
     DRAGON Families integrates two pieces of information that you provide. The first is the ratio expression data which you derive from fluorescent based Cy3/Cy5 microarray experiments. The second is the "type" information which you derive from the DRAGON database Annotate tool. Type information can be many things. Presently, Pfam numbers, Swissprot keywords and KEGG numbers are the most useful type information which DRAGON provides. The Instructions on the DRAGON Families page will guide you through the process of entering data. View the example data sets if you have questions about what your data should look like. NOTE: Unlike the Annotate tool, it is best to use comma-delimited text files with the DRAGON View tools.
     The biggest confusion with DRAGON Families comes with the output. What exactly does it mean? Here is a diagram and brief explanation.

     Each box in the diagram above represents one gene (as defined by the Unigene database). If you click on any box, you will be hyperlinked to the Unigene cluster which corresponds to that gene. The color of each box represents the expression of each of your genes.

Dark Red = 0 < X < 2.0
Bright Red = X >= 2.0

Dark Green = 0 > X > -2.0
Bright Green = X <= -2.0

     The black text after each row of boxes indicates the type of all of the genes in that group. It is important to realize the fact that all of the boxes on any given row are ALL OF THE GENES IN YOUR DATA THAT HAVE THAT TYPE. Therefore, large numbers of genes which are all in the same group and are all up or down regulated are potentially interesting. (We are currently implementing statistical tests for the significance of this type of result that will be available as part of your output shortly). If you have indicated the type information you are using correctly while inputting your data, then if you click on the hyperlink you should be linked to the proper description of that type. Finally, the blue number in parentheses is the average ratio expression value for all of the genes in that group.

     DRAGON Order is another information visualization tool which attempts to get at the same question addressed by DRAGON Families but from a different angle. The main difference between the two tools is that DRAGON Order automatically pre-sorts data by ratio expression values. Here is a diagram and description of the output.

   +                                   -

     Each row of yellow lines in the picture above is representative of the entire list of genes which you entered. Each row is defined by a type written in white letters to the right. The position of each yellow line in each row indicates that there is a gene which belongs to the type that defines that row. So, for example, the first row in the picture above is "Transmembrane." The "transmembrane" keyword is a rather broad category, therefore, there are a large number of yellow lines in this row. Where ever there is a yellow line in the row, that means that a gene which encodes a protein which has a transmembrane domain is present. The key is that, because your list of genes is sorted by their ratio expression values before you entered them into DRAGON Order, the position of each yellow line is indicative of the expression level of that gene (the + (up-regulated) and - (down-regulated) signs at the top of the picture are indicative of the expression levels across the data). Therefore, an equal distribution of yellow lines across the whole row means that there is no significant co-expression of a set of genes in that group. However, clusters of yellow lines at either the far left or the far right of any given row is interesting because it means that a set of related genes are all up or down regulated. For example, four of the five "Cell Adhesion" genes are clustered to the left of the row.

     The concept of DRAGON Paths is relatively straight-forward. DRAGON Paths uses diagrams of cellular pathways directly downloaded from the Kyoto Encyclopedia for Genes and Genomes (KEGG) database in order to map the users gene expression values onto these cellular pathway diagrams. The idea is that by viewing the expression levels derived from microarray data within the context of cellular pathways, the user might be able to detect patterns of expression that might not otherwise be apparent. Conveniently, KEGG provides a coordinate file for every single cellular pathway diagram in their database. The coordinate files (you can view these files by following the link on our home page to the *_gene_coord files at the KEGG ftp site) provide x,y coordinates on each diagram for every protein product in the diagram along with the Locuslink number of every gene corresponding to those Locuslink numbers. DRAGON Paths simply uses the Locuslink numbers provided by the user to map the expression values for each of the user's Locuslink numbers with the corresponding coordinate location of that Locuslink number on any of the chosen KEGG cellular pathway diagrams. Here is an example of a DRAGON Paths output.

     Each green box in the diagram is representative of a human protein (see the KEGG web site for more information on the configuration of the KEGG pathway diagrams and why some boxes are green and some boxes are white). The numbers in the boxes are the EC number for the proteins. Red or green circles are placed in the upper left corner of each protein that is found in your data. The color of the circle is indicative of expression level. Red means up-regulated, green means down regulated. Each green box is hyperlinked to the Locuslink entry for that protein.


General Remarks
It is usually best to only request one type of information from DRAGON at a time. For example, if you want to know about KEGG #'s, Pfam #'s and SWISS-PROT Keywords for your gene list, generate three different queries each requesting one of the three different types of information.

Annotate Page
1) If one of the characteristics you choose to associate with a list of Genbank numbers doesn't exist for some of the Genbank numbers on that list, you may find that all of the rest of the data for that gene is also missing. This is a programming error that we are currently addressing.
2) You may find that one of your Genbank numbers is being improperly associated with other sorts of information such as the wrong Swissprot and Pfam numbers. This is usually due to the existence of certain Genbank numbers that refer to very large pieces of DNA such as BAC and YAC sequences. BAC sequences, for example, can contain many genes and therefore that Genbank number can be clustered with more then one Unigene Cluster ID. Since DRAGON uses Unigene Cluster ID's, your Genbank number can be associated with incorrect information because of the existence of these large Genbank numbers. We are currently correcting this problem in the online version of DRAGON.
3) If you are getting strange errors in the output you receive from DRAGON. For example, if you are getting nothing back but know that there should be some output, or if the names of the genes are being put in the "Genbank Number" column and nothing else is output, try only inputting a column of Genbank numbers, expression values and nothing else. White spaces and other characters in columns containing names for example can confuse the DRAGON search engine. In fact if you are getting any sorts of strange errors try just inputting two columns, Genbank numbers and expression data. This is a programming error that we are currently addressing.
4) If you ask for just the Unigene Cluster ID for a list of Genbank numbers, you may get nothing in return. Try asking for some other data as well, such as the Locuslink number and you should now get back both the Unigene number and the Locuslink number. This is a programming error that we are currently addressing.
5) Any data set that contains over ~1000 GenBank numbers is going to take awhile and the email option should be used as the output option with these types of data sets.

Search Page
Your searches are taking a really long time. Sometimes, they take so long that your browser times out and says that there was no data returned from the server. This is an indexing problem in the database that is currently being addressed.
2) You can't search for characteristics across databases, such as things on chromosome 10 (Unigene) that have a EF-hand domain (Pfam). At present, you can only search within individual databases.
3) You got a blank screen as a result. You may not have checked the proper radio button for the database you would like to search. For example, if you want to search for something in Swissprot and you enter information into one of the fields (i.e. the Keyword field), you then have to also click the radio button to the left of the name "Swissprot" at the top left of the Swissprot section of the table.

Frequently Asked Questions

Question: Can I use other sorts of accession numbers, such as Tigr #'s or SWISS-PROT #'s as my entry accession # instead of GenBank #'s on the Annotate page?
Answer: No, currently only GenBank numbers can be used as entry accession numbers on the Annotate page.

Question: Where do you get the information that is present in DRAGON?
Answer: The information present in DRAGON is extracted from a specific set of flat-files that are provided by the public databases integrated in DRAGON. You can view each of these flat-files by going to the DRAGON home page and looking in the bottom right hand section. A list of every single flat-file used in DRAGON and a link directly to it is provided.

Question: How often do you update DRAGON?
Answer: The DRAGON database is updated weekly (Sunday morning).

Question: Can I have a copy of the Perl scripts used to implement DRAGON?
Answer: Not presently, however, all of the data used in every single DRAGON tool, whether it is Annotate or Search etc. is available for download via this web-site.

References for DRAGON View

Bassett,D.E., Eisen,M.B., Boguski,M.S. (1999) Gene expression informatics-it's all in your mine. Nat. Genet. Suppl., 21, 51-55.

Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45-48.

Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L., Sonnhammer,E.L.L. (2000) The Pfam protein families database. Nucleic Acids Res., 28, 263-266.

Bouton,C.M. and Pevsner,J. (2000) DRAGON: Database Referencing of Array Genes Online. Bioinformatics, 16, 1038-1039.

Duggan,D.J., Bittner,M., Chen,Y., Meltzer,P., Trent,J.M. (1999) Expression profiling using cDNA microarrays. Nat. Genet. Suppl., 21, 10-14.

Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA., 95, 14863-14868.

Kanehisa,M. and Goto,S. (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28, 27-30.

Velculescu,V.E., Zhang,L., Vogelstein,B., Kinzler,K.W. (1995) Serial analysis of gene expression. Science, 270, 484-487.

Zhang,M.Q. (1999) Large-scale gene expression data analysis: a new challenge to computational biologists. Genome Res., 9, 681-688.

Copyright 2001 Kennedy Krieger Institute