Chapter: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | App 1 | App 2
Outline summary
TOPIC 1: INTRODUCTION TO GCG
TOPIC 2: GENERAL COMMANDS
TOPIC 3: EDITING SEQUENCES
TOPIC 4: PAIRWISE ALIGNMENT
TOPIC 5: MULTIPLE SEQUENCE ALIGNMENT
TOPIC 6: PHYLOGENY
TOPIC 7: SEQUENCE ANALYSIS
One-page summary sheet of helpful commands
|
General commands (UNIX stuff): |
|
|---|---|
| cat x | shows the document x |
| cat | more | shows the document, one page at a time (spacebar to advance a page) |
| ls | lists the files in a directory |
| ls *.pep | lists all the files in a directory that end with the suffix .pep |
| ^c | interrupt what you’re doing and return to the command line |
| ^s | pause the output on the screen |
| ^q | resume the output |
| General commands (GCG): | |
|---|---|
| mkdir | create a new directory |
| mkdir a.dir | create a new directory named a.dir |
| down a.dir | move from the home directory down to a.dir |
| over b.dir | move over to the directory called b.dir |
| genhelp | get help on all the 120 programs within GCG |
| genhelp gap | get help on the program called gap |
| rm x | delete a file called x |
| rm *z* | delete all the files with a z in the name |
| -che | check parameters |
| Entering and editing sequences: | |
|---|---|
| seqed x.pep | make (or edit) a protein file called x |
| seqed y.seq |
make (or edit) a DNA file called y |
| exit | end a seqed session and save your changes ^D |
| quit | end a seqed session without saving changes |
| include | paste in portion of another DNA sequence |
| help | get help on sequed |
| onecase | make sequences all upper or lowercase |
| onecase *.pep | make all protein sequences one case |
| Comparisons | |
|---|---|
| bestfit | compare two proteins or DNA sequences; focus on a local alignment |
| gap | compare two proteins or DNA sequences; focus on a global alignment |
| tfasta | compares one protein to a DNA query translated in all 6 possible frames |
| pileup | compare up to hundreds of sequences in a multiple sequence alignment |
| Sequence analysis | |
|---|---|
| prime | design oligonucleotide primers |
| map | make a map of a DNA sequence; show restriction enzymes |
| map -six | make a map showing restriction enzyme six-cutters |
| motifs | x.pep scan the prosite dictionary for motifs and domains present in protein x |
The GCG package is a collection of >130 interrelated software programs used to analyze protein, RNA, and DNA sequences. It is a commercial product that is similar in scope to Vector NTI (by Informax) and to the freely available, UNIX-based tools of Emboss (http://www.hgmp.mrc.ac.uk/Software/EMBOSS/).
“GCG® (Genetics Computer Group) serves molecular biologists by building practical tools that implement the most important techniques of mathematical biology. Founded in the Department of Genetics at the University of Wisconsin-Madison in 1982, GCG continued as a service of the UW Biotechnology Center from 1985 to 1989. Now a subsidiary of Oxford Molecular, GCG remains committed to the development and support of sequence analysis software as well as relational database management tools for storing, mining, and visualizing sequence data.
GCG's flagship product, the Wisconsin Package is the industry standard for sequence analysis, containing over 130 interrelated software programs to analyze nucleic acid and protein sequences. It is used by more than 40,000 scientists at over 650 institutions around the world. Since the product's inception, scientists from all over the world have collaborated to develop and refine the Wisconsin Package.”
| GCG topics include: | |
|---|---|
| * Comparison | * Importing and Exporting |
| * Database Searching | * Mapping |
| * DNA/RNA Secondary Structure | * Primer Selection |
| * Editing and Publication | * Protein Analysis |
| * Evolution | * SeqLab |
| * Fragment Assembly | * Translation |
| * Gene Finding and Pattern Recognition | * Utilities |
General commands (UNIX)
When you use the GCG suite of programs you are in a UNIX environment. Some useful
commands follow:
| > cat x | shows the document x |
| > more x | shows the document x, one page at a time (press spacebar to advance) |
| > ls | lists the files in a directory |
| > ls *.pep | lists all the files in a directory that end with the suffix .pep |
| > ^C | interrupt what you’re doing and return to the command line |
| > ^S | pause the output on the screen |
| > ^Q | resume the output |
| > man | on-line help manual |
Within the world of GCG, there are several important commands to know.
Getting help. For any program you’re working on, it’s very useful to read through the help menu. Typically, this menu provides features such as
• an overview of what the program does
• an example of how to actually use it
• a description of the algorithm used, together with literature references
• a comparison of what this program does to what related GCG programs can do
> genhelp get help on all the 130 programs within GCG
> genhelp gap get help on the program called gap
Organizing your files into directories. One can quickly accumulate dozens (or
hundreds) of GCG files. A good way to organize your files is to make a separate
directory for each project you are working on. When you want to make a directory
simply type
mkdir
followed by the name fo the new directory. Then to get to that directory in
order to work in it, type “down a.dir.” You can move between directories
laterally by typing “over any.dir.” To get back to your main, starting
directory you can always type “home.” And if you want to confirm
where you are at any time, you can always type “ls” to list whatever
files you have and to see the name of your current directory or subdirectory.
| > mkdir | create a new directory home |
| > mkdir a.dir | create a new directory called a.dir |
| > down a.dir | move from the home directory down to a.dir |
| > over b.dir | move over to the directory called b.dir |
| > up | move up to a higher directory |
| > home | go to your main, home directory |
Naming files.
A handy rule is to name any protein file you work with something with the suffix
“.pep” (for peptide; for example, actin.pep). Name any DNA file
with the suffix “.seq” (for sequence; for example, actin.seq). You’ll
typically get the chance to name a file when you first create it in the main
sequence editor, seqed (see below). The GCG programs like to see names such
as a.pep or a.seq so they know that the appropriate data are being entered into
a program (e.g. they want .seq files for restriction enzyme analyses, or .pep
files for hydropathy profiles). There are some exceptions; for example, a DNA
sequence that has been flipped to show you bottom strand sequence has the suffix
something.rev.
Deleting, renaming, and copying files.
When you are deleting files, be very careful! It’s easy to delete, and
there’s no turning back. Type > rm x to remove a file called x. (rm
stands for remove)
| > rm x | delete a file called x |
| > rm *z* | delete all the files with a z in the name |
| > rm *.pep | delete all your peptide/protein files with the suffix .pep |
To rename a file, use the mv (move) command.
> mv somename.pep bettername.pep will rename a file ‘somename.pep’
as a file called ‘bettername.pep’
To copy a file, use the cp (copy) command.
> cp *.pep ~/somedirectory/ will copy all your files containing the suffix
.pep to your a directory called “somedirectory”
Note that the tilda ~ represents your home directory.
The main feature of GCG is its ability to manipulate sequences. In most cases,
you need to begin by importing some sequence into GCG. Typical examples of what
you’d import include the following:
• you get some DNA sequence from the Core Facility on your computer, and
you want to know how well it matches a known DNA sequence or a known protein
sequence.
• you directly sequence a protein and want to analyze its properties.
• someone sends you a recombinant fusion protein and you want to know
its properties.
• you need to know if two proteins are significantly homologous or not.
In many cases, you will begin by finding a protein or DNA sequence through the National Center for Biotechnology Information (NCBI) or SWISS-PROT (Chapter 2).
Copy your DNA or protein sequence, then paste it into GCG using the sequence
editor seqed.
> seqed x.pep make (or edit) a protein file called x.pep
> seqed y.seq make (or edit) a DNA file called y.seq
You find yourself now in the world of seqed. On the page, there are three levels:
a header up top, a sequence buffer in the middle, and a command line editor
down below. Type ^D (control D) to move down to the sequence buffer, then “paste”
in your sequence. Once this
: exit end a seqed session and save your changes
: quit end a seqed session without saving changes
: include paste in portion of another DNA sequence
: help get help on seqed

Here is an example of how to enter human retinol-binding protein (RBP) into GCG.
In GCG, type the following (hsrbp stands for Homo sapiens retinol-binding protein):
> seqed hsrbp.pep <return>
Note that the GCG cursor is now in the top header region:

In Netscape, open the GenBank protein record for RBP. Copy the accession number and any other relevant information into the header region. When you are finished, press ^D to enter the main sequence buffer. Copy and paste the amino acid sequence of RBP from Netscape into GCG (note that GCG will ignore numbers that are associated with the letters of protein or nucleic acid sequences):
You have successfully created the file hsrbp.pep.
Type > ls to confirm that this file has appeared in your directory.
Type > cat hsrbp.pep to see this file displayed:
> cat hsrbp.pep
Optionally, you can use the following commands:
> onecase make sequences all upper or lowercase
> onecase *.pep make all protein sequences one case
In this handout, we will also use the example of syntaxins. These are proteins that bind vesicle proteins and coordinate the docking of vesicles with the appropriate target membrane.
bestfit compare two proteins or DNA sequences; focus on a local alignment
gap compare two proteins or DNA sequences; focus on a global alignment
tfasta compare one protein to the six proteins potentially encoded from one
DNA sequence
Example of how to use the bestfit program
In this example, compare two proteins (human syntaxin 1a and human syntaxin
7) by typing “bestfit” then then names of the proteins to be compared
(with a space between them).

What do you need to know about this output?
• the alignment uses the one letter amino acid code; memorize it!
• the lines indicate amino acid identities
• the dots indicate similarity; this is much less informative than identity
• the gap weight and gap length penalties are parameters you can adjust
to optimize the alignment you are performing. For example, if you believe that
two proteins are distantly related but they don’t align well, try reducing
the gap penalty to see if a less stringent alignment criterion yields a better
Bestfit (at the cost of adding more gaps).
When you do a Bestfit, you can ask the program to measure the statistical significance of a protein comparison.
Type:
> bestfit humsyn1a.pep humsyn7.pep -ran=50

These data of the average quality based on 50 randomizations let you answer the key question, “when I get a quality score from the comparison of my two proteins (Q = 194 in this case), is that score much better than if I had just randomly shuffled one of the proteins 50 times (Q = 35.2 +/- 4.0)?” The answer is yes, because the Z score for the authentic comparison is more than three standard deviations greater than the randomized score. For this example, Z = (194 - 35.2)/ 4.0 = 39.7. If Z < 3, your two proteins are not significantly related.
GCG documentation on the bestfit program (> genhelp bestfit)
BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman.
DESCRIPTION
BestFit inserts gaps to obtain the optimal alignment of the
best region of similarity between two sequences, and then
displays the alignment in a format similar to the output from
Gap. The sequences can be of very different lengths and have
only a small segment of similarity between them. You could take
a short RNA sequence, for example, and run it against a whole
mitochondrial genome.
BestFit is the most powerful method in the Wisconsin
Package(TM) for identifying the best region of similarity
between two sequences whose relationship is unknown.
Example of how to use the Gap program

GAP (global alignment) versus Bestfit (local alignment; BLAST-like)
Note that while Bestfit and GAP are similar--each allows you to compare two sequences--they do have distinct differences. Bestfit is optimized for local alignments, finding the best region of homology between two proteins (or DNA sequences). GAP is optimized for global alignments, scanning the entire sequences. In the above example of two syntaxins, bestfit showed a better percent amino acid identity (27% over a length of 216 residues), while GAP showed only 24% identity but over a longer span of 292 residues.
GCG documentation on the gap program (> genhelp gap)
Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps.
DESCRIPTION
Gap considers all possible alignments and gap positions and creates the alignment with the largest number of matched bases and the fewest gaps. You provide a gap creation penalty and a gap extension penalty in units of matched bases. In other words, Gap must make a profit of gap creation penalty number of matches for each gap it inserts. If you choose a gap extension penalty greater than zero, Gap must, in addition, make a profit for each gap inserted of the length of the gap times the gap extension penalty. Gap uses the alignment method of Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970).
Example of how to use the TFastA program



When do you use TFastA?
• If you obtain a DNA sequence (e.g. from a BLAST search or from the core
facility following a yeast 2-hybrid result), a TFastA search will tell you its
relation to any given protein. If there are frameshifts due to sequencing errors,
the TFastA search will show this. Using the GCG map program, errors can be corrected.
• TFastA is an important tool for new gene discovery (a process called
“database mining,” “in silico genomics,” or sometimes
“in silico hallucinations”). If you are studying a gene family and
want to find a new member, go to GenBank and conduct a series of tblastn searches
with your proteins of interest against all the databases of expressed sequence
tags. Download all DNA database entries relevant to your protein queries into
GCG. Then perform TFastA searches to assign each EST (i.e. DNA sequence) to
a protein. If an EST does not match any protein in your family, you have identified
a novel gene.
GCG documentation on the PileUp program (> genhelp pileup)
Function of PileUp
PileUp creates a multiple sequence alignment from a group of related sequences
using progressive, pairwise alignments. It can also plot a tree showing the
clustering relationships used to create the alignment.
Description of PileUp
PileUp creates a multiple sequence alignment using a simplification of the progressive
alignment method of Feng and Doolittle (Journal of Molecular Evolution 25; 351-360
(1987)). The method used is similar to the method described by Higgins and Sharp
(CABIOS 5; 151-153 (1989)).
The multiple alignment procedure begins with the pairwise alignment of the two
most similar sequences, producing a cluster of two aligned sequences. This cluster
can then be aligned to the next most related sequence or cluster of aligned
sequences. Two clusters of sequences can be aligned by a simple extension of
the pairwise alignment of two individual sequences. The final alignment is achieved
by a series of progressive, pairwise alignments that include increasingly dissimilar
sequences and clusters, until all sequences have been included in the final
pairwise alignment.
Before alignment, the sequences are first clustered bysimilarity to produce
a dendrogram, or tree representation of clustering relationships. It is this
dendrogram that directs the order of the subsequent pairwise alignments. PileUp
can plot this dendrogram so that you can see the order of the pairwise alignments
that created the final alignment.
As a general rule, PileUp can align up to 500 sequences, with any single sequence
in the final alignment restricted to a maximum length of 7,000 characters (including
gap characters inserted into the sequence by PileUp to create the alignment).
However, if you include long sequences in the alignment, the number of sequences
PileUp can align decreases. See the RESTRICTIONS topic, below, for a more complete
discussion of
sequence number and size limitations.
Example of a PileUp
There are two easy ways to do a PileUp.
(1) On the command line, type > PileUp *.pep
All files with the suffix .pep will be made into the pileup
(2) If your directory has lots of files in it, you can create a special file
called “anyname” that includes the specific proteins (or DNA sequences)
you want to have included.


What is a phylogenetic tree?
A phylogenetic tree is a graphical representation of the evolutionary history of a protein family (or other family such as a gene family). Phylogeny reconstructs the branching pattern of a group of proteins; the topology of a tree shows the history (ancestors) of the protein over time.
Why are trees useful?
A phylogenetic tree provides a visual demonstration of how a group of proteins
are related. Suppose you are studying a particular kinase cloned from a mouse
cDNA library. A tree can help answer the following questions.
• does your kinase have closely related homologs? For example, if you
knock out its gene, are there closely related kinases? If you alter the kinase
activity in cells with a drug, are there other kinases likely to be affected
as well?
• how should you name it?
• has it been relatively conserved throughout evolution, or has its sequence
changed rapidly? This may give clues as to its function.
• what individual amino acid residues are critical for the function of
my kinase?
How do I make a tree?
To make a tree, begin by doing a PileUp of your group of proteins (or nucleic acid sequences). The output of PileUp can be a multiple sequence alignment, with a suffix pileup.msf (multiple sequence format). Alternatively, you can specify the output to be a simple tree by choosing output option B.
We will describe how to make a tree in detail in the week of September 25. The key tree-making programs in GCG are distances, which makes a matrix of pairwise distances between the sequences in a multiple sequence alignment. Diverge measures the number of synonymous and nonsynonymous substitutions per site of two or more aligned protein coding regions and can output matrices of these values. GrowTree reconstructs a tree from a distance matrix or a matrix of synonymous or nonsynonymous substitutions.
In this course we will use PAUP (Phylogeny Analysis Using Parsimony), a superb
tree-building program. PAUP has also been incorporated into GCG. PAUPSearch
reconstructs
phylogenetic trees from a multiple sequence alignment using parsimony, distance,
or maximum likelihood criteria; PAUPDisplay can manipulate and display the trees
output by PAUPSearch and can also plot the trees output by GrowTree.
| prime | design oligonucleotide primers |
| map | make a map of a DNA sequence; show restriction enzymes |
| map -six | make a map showing restriction enzyme six-cutters |
| motifs x.pep | scan the prosite dictionary for motifs and domains present in protein x |
GCG documentation on the Prime program (> genhelp prime)
FUNCTION
Prime selects oligonucleotide primers for a template DNA sequence. The primers may be useful for the polymerase chain reaction (PCR) or for DNA sequencing. You can allow Prime to choose primers from the whole template or limit the choices to a particular set of primers listed in a file.
DESCRIPTION
Prime analyzes a template DNA sequence and chooses primer pairs for the polymerase
chain reaction (PCR) and primers for DNA sequencing. For PCR primer pair selection,
you can choose a target range of the template sequence to be amplified. For
DNA
sequencing primers, you can specify positions on the template that must be included
in the sequencing.
GCG documentation on the Map program (> genhelp map)
FUNCTION
Map maps a DNA sequence and displays both strands of the mapped sequence with restriction enzyme cut points above the sequence and protein translations below. Map can also create a peptide map of an amino acid sequence.
DESCRIPTION
Map displays a sequence that is being assembled or analyzed intensively. Map
asks you to select the enzymes whose restriction sites should be marked individually
by typing their
names. If you do not answer this question, Map selects a representative isoschizomer
from all of the commercially available enzymes. You can choose to have your
sequence
translated in any or all of the six possible translation frames. You can also
choose to have only the open reading frames translated.
After running Map, you may create a new sequence file with the peptide sequence from any frame of DNA translation by using the ExtractPeptide program with the Map output file.
Example of how to use the Map program
Here we input a DNA sequence (human syntaxin 7) and specify that we want the
output to show only enzymes that are six-cutters (158 enzymes including EcoRI,
PvuII, BamHI, etc.). In all, the program can analyze sites for over 500 restriction
enzymes. Note that you can specify an output display ranging from no predicted
proteins to all six potential reading frames.

FUNCTION
Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.
DESCRIPTION
Motifs looks for protein sequence motifs by checking your protein sequence for every sequence pattern R in the PROSITE Dictionary. Motifs can recognize the patterns with some of the symbols mismatched, but not with gaps. Currently, Motifs can only search for patterns in protein sequences. There is a very informative abstract on every motif in the PROSITE Dictionary. These abstracts are displayed next to any motif found in your sequence.
Example of how to use the Motifs program

