Microarray Analysis Tutorial
DNA microarray experiments have emerged as one of the most popular tools for
the large-scale analyisis of gene expression. Microarray experiments typically
involve the measurement of the expression levels of many thousands of genes
in only a few biological samples. Often, there are few technical replicates
(i.e. measuring gene expression with the same starting material on independent
arrays), usually because of the relatively high cost of performing microarray
experiments. There are also few biological replicates (e.g. measuring gene expression
from multiple cell lines, each of which has been given an experimental treatment
or a control treatment) relative to the large number of genes represented on
the microarray. The challenge to the biologist is to apply appropriate statistical
techniques to determine which changes are relevant. There is unlikely to be
a single best approach to microarray data analysis, and the tools applied to
microarray data analysis are evolving rapidly. We will describe microarray data
analysis in three parts.
- Data preprocessing
- Inferential statistics
- Descriptive (exploratory) statistics
First, data are normalized and preprocessed. This is essential to allow data sets from two (or more) samples to be compared to each other. Second, inferential statistics are applied. This is also called hypothesis testing, and it allows us to make statements about the likelihood that particular genes are significantly regulated. Third, exploratory statistics (also called descriptive statistics) are applied. This set of approaches includes clustering and principal components analysis, and is used to inspect the complex data set for biologically meaningful patterns. In some microarray studies classification is applied in order to diagnose physiological states (e.g. cancerous versus control cells) based on gene expression profiles. There have been several excellent reviews of microarray data analysis (Quackenbush et al., 2001; Sherlock, 2001; Dopazo et al., 2001; Brazma and Vilo, 2000).
Regardless of the microarray platform that is used, we may begin data analysis
by creating a matrix of genes (along rows) and samples (arranged in columns)(Figure
7-1)(see book). For microarray technologies
that rely on competitive hybridization of two samples on an array, the gene expression
values are ratios or relative intensities. Often, these are ratios of intensity
values for the Cy3 (green) dye and the Cy5 (red) dye. The intensity of each signal
is assumed to be directly proportional to the abundance of mRNA for each gene.
In another commonly used scenario, a single sample is hybridized to a microarray.
This is the case for many platforms using radioactivity-labeled cDNA or for platforms
such as Affymetrix using oligonucleotides immobilized on a chip. In this case
absolute values will be obtained for two (or more) experimental conditions. These
absolute values can be divided for each gene to obtain ratio values.
Step 1: Accessing microarray data
Follow these instructions to access a database (or go to the link in step 5, below):
- Go to the Stanford Microarray Database at
http://www.dnachip.org.
- Click on published data.
- Select the Web Supplement for:Chu et al. (1998), The transcriptional program of sporulation in budding yeast (Science 282:699-705).
- Select: 'Additional Figures and Complete Data Set.'
- Select: 'Spo Spreadsheet.' (direct link) A text-only spreadsheet will open in a new browser window.
- Save the file to the hard drive of your computer.
- Start Microsoft Excel and open the text file that you just saved.
Column A contains names for each of the genes on the microarrays. Columns B and D contain fluorescence intensity values for green and red labeled samples, respectively, at time t = 0. Columns C and E contain background intensity values for the spots on the array.
Note: We will use Microsoft Excel. I highly recommend S-PLUS, a sophisticated spreadsheet program.
Step 2: Creating an Excel graph without correcting for background fluorescence
In this portion of the analysis, we will ignore the background fluorescence measurements (columns C and E) and just work with the spot intensity values at time t = 0. First, raw intensities will be used, then the base 10 logarithm of each intensity will be used.
Discussion question: what are the advantages of using logs?
Step 3. Making a graph of the raw intensities
- Click on the 'B' at the top of column B to highlight the entire column.
- Hold down the 'Ctrl' key and click on the 'D' at the top of column D. This will select column D while keeping B selected (but the column in between them will not be selected.)
- Click on 'Insert' at the top of the page and choose 'Chart.'
- Select 'XY (Scatter)' and then click 'Next'. A graph should appear, containing red intensity values on the y-axis and green intensity values on the x-axis. This graph is similar to the one shown in Figure 7-3, panel B.
- Click 'next'. This box allows you to choose a title and axis labels for your graph.
- Click 'next' again. Here, you can decide whether to save the graph as its own sheet within the file or as an object that can be moved around on top of the spreadsheet. If you choose to save it as its own sheet, you can use the tabs at the bottom to toggle between the spreadsheet and the graph.
- Click 'finish' after selecting 'new sheet' or 'object.'
Note: When creating a graph, the first column that you highlight will be used as the x-axis and the second column will be the y-axis.
Step 4. Making a graph of log intensities
Insert a new column in the spreadsheet after column B.
- Click on the 'C' above column C.
- Click 'Insert' at the top of the page and choose 'columns'. In this column, you will calculate the base 10 logarithm of each of the intensity values in column B (t0 green).
- Click on the second box in the new column (now column C).
- Move the cursor above the columns to the equal sign (=) in front of the blank line. The words 'Edit Formula' should appear in a beige box as you mouse over the = sign.
- Click once on the = sign.
- Then enter 'LOG10(B2)' on the blank line and press 'enter'. The number '3.552668' should appear in box C2.
- Now copy this formula to the rest of the boxes in column C:
- Click once on box C2.
- Then go up to the top of the screen and click on 'Edit' and choose 'Copy'.
- Highlight all of the remaining boxes in column C.
- Then click on 'Edit', then 'Paste.'
- Repeat these steps with the t0 red values.
- Make a scatter plot of the log columns, as shown in
Figure 7-3, panel C of the textbook.
Step 5. Creating an Excel graph taking background fluorescence into account
In this portion of the analysis, incorporate the information on background fluorescence measurements (originally columns C and E, but now columns D and G if you inserted two columns in part 2 above). The 't0 green bkg' and 't0 red bkg' columns give measurements of fluorescence from the local area around each spot on the filter, thus they give an indication of non-specific binding of labeled RNA. We would like to exclude this from the intensity of each spot, as non-specific binding may differ across the filter or may differ between the green and red labels.
- Insert a column following column D ('t0 green bkg').
- Use the = sign ('Edit Formula') button to subtract 't0 green bkg' from 't0 green.' Make sure that you subtract the raw 't0 green' values, not the log values. Repeat for 't0 red bkg' and 't0 red.'
- Create two additional columns and calculate log intensities for the background-adjusted intensities that you just created.
- Make a graph using these new adjusted, log 10 values. If background intensities were uniform across the array and between the red and green values, this graph should not be much different than the previous graph (see Figure 7-3, panel C).
Step 6. Normalizing to total fluorescence
Because there may have been more total mRNA in one sample than in the other, it is important to normalize each individual intensity value to the average fluorescence in that sample.
- Insert one new column in which you will calculate normalized raw minus background values for green data and one for red data. (Note: we will be normalizing the values in the raw minus background columns, not the log values that you used for the last graph. If you have done all of the steps so far, these should be columns E and J.)
- Find the average (mean) value for each of the raw minus background columns.
- To do this, click on the space below the last value in the column (e.g. box E6120).
- Click the = sign and type in 'AVERAGE(E2:E6119)'. When you hit 'enter', the mean of the entire column will appear in the box.
- You will then divide each of the values in the raw minus background column by this mean. To do this, click on box 2 in your new column, click the = sign and type in 'E2/xxxx', where xxxx is the numerical value (not the box location on the spreadsheet) of the mean of the raw minus background column.
- Copy and past this formula into the rest of the column. Repeat for the red data.
- Make two new columns, one for green and one for red. In these columns calculate the log of the normalized raw minus background values (i.e., the columns that you made in the previous step with (raw minus normalized) divided by (mean of raw minus normalized)).
- Make a graph with these new columns.
Step 7. Plotting log ratio of intensities vs. mean log intensity
It is useful to look at the ratio of intensities as a function of mean log intensity because there may be more variation in expression levels of genes that are expressed at a low level than genes whose transcripts are abundant (or vice versa). This plot is also helpful in detecting curvature in the cloud of data points.
Create three new columns in the spreadsheet. In one column, calculate for each gene, the mean of the last two columns that you created in part [7-6]. These numbers will represent the green-red average of the log of the normalized background-subtracted intensities for each gene. This column will be the x-axis of the graph.
In another column, divide each value in the red normalized raw minus background column by each value in the green normalized raw minus background column. DO NOT use the log normalized columns (i.e., the last columns that you created in part [7-6]) but use the columns that you created before taking the log in part [7-6].
In the third column, take the log of the values in the column that you just created. So, this column will be the log of the ratio of normalized background-subtracted intensities. Use this column as the y-axis of this graph.
The graph will be similar t one in Figures 7-4 and 7-5. Mouse over any outliers to learn their identities.
When a microarray experiment is completed and the data arrive, the first question most investigators ask is: "Which genes were most dramatically up- or down-regulated in my experiment?" This can be answered using inferential statistics, a branch of data analysis in which probabilities are assigned to the likelihood that a gene is significantly regulated.
- A spreadsheet listing all the genes represented on the array and all the expression values can be sorted to show the most differentially regulated genes.
- A scatter plot (see below) can help to quickly profile the behavior of the most regulated genes.
- A t-test can be used to describe the probability that a gene is regulated.
We will use Statistical Analysis of Microarrays (SAM) and Partek software to perform t tests. SAM is a Microsoft Excel plug-in.
A fundamental question that may be asked of microarray data is: "What signatures (or patterns or profiles) of gene expression can be found in all the gene expression values obtained in this experiment?" This type of question is addressed using descriptive statistics or exploratory analysis. Clustering trees can show the relationships between samples (such as normal versus diseased cells), between genes, or both. Other tools for the analysis of gene expression include principal components analysis, multidimensional scaling, and self-organizing maps. We will consider all these tools for the analysis of array data.
- Principal components analysis (PCA) with Partek.