GENOME PIXELIZER 2D PLOTTER
PROGRAM DESCRIPTION
GenomePixelizer 2D plotter (or genoPix2D) generates images (actually
interactive canvases) of genomic similarity dot plots, in which each "dot"
indicates similarity between a pair of genes. Diagonal runs of dots
generally indicate collinearity in the genomic regions being compared.
The program can compare large (chromosome-scale or even eukaryotic
genome-scale) genomic regions, and can produce PostScript output of the
dot plots.
One group of genes (on the X-axis) is compared to another group of genes
(on the Y-axis). If there is a match (similarity) above a defined cutoff
value between a pair of genes (Xi,Yj) then a dot is placed on the 2D plot
with corresponding coordinates (Xi,Yj).
Xi Xk Xp
+------------------> X axis
| . . .
| . . .
Yj |.* . .
| . .
| . .
Ym |........* .
| .
Yq |..............*
V
Y axis
Figure 1
For example (Figure 1), pairs of genes Xi-Yj, Xk-Ym, Xp-Yq will result in
three dots plotted on 2D graph. If the order of the genes (Xi-Xk-Xp) in
genome X is the same as the order of the genes (Yj-Ym-Yq) in genome Y, then
the plotted dots will form a diagonal. If the order of the genes is
different for both genomes, then the dots will most likely form, random
patterns.
Thus, by identifying diagonals on 2D dot plots it is possible to identify
'syntenic' regions between two different genomes or two chromosomes within
genomes. 'Synteny' means 'same thread' (or ribbon), a state of being
together in location.
The idea of diagonal plots is not new. A search of http://www.google.com/
for keywords
"genome, diagonal, plot"
can find thousands web links to
different programs, published papers and unpublished data to this subject.
What is new in the GenomePixelizer 2D plotter:
1. Images on canvas are not static. They are interactive, and every dot
(element) is associated with corresponding gene annotations that are
searchable by gene ID.
2. A different color scheme can be assigned to every dot (element)
interactively (the "Node painter" function). The "canvas editor" function
allows a user to finish a project by adding custom text labels and simple
graphics.
3. Genes with "hits" (similarity) to other genes, so called "plotted" pairs,
and genes without matching pairs, so called "unplotted" genes, are displayed
in two distinct groups.
4. Zoom-in functionality. By specifying genome coordinates, it is possible
to select regions of interest and display them in a new scale.
5. Scalability. Program can generate canvases and corresponding PostScript
images of very large size (6 feet x 6 feet is not a limit).
6. The program comes with a BLAST parser that generates Matrix files
automatically from BLAST search results.
7. GenomePixelizer 2D plotter is a cross-platform desktop application. It
runs on any computer supporting Tcl/Tk interpreter
(http://tcl.activestate.com)
INPUT FILES
Two input files are required: a "Matrix" file (containing BLAST hit
information), and a "Coords" file (containing gene coordinates). A third
file, with gene annotations, is optional.
The "matrix file" contains identity scores for pairs of genes:
column 1: gene ID in genome A;
column 2: gene ID in genome B;
column 3: identity score for given pair (normalized to [0,1]);
column 4: normalized expectation values;
column 5: alignment length.
Only first three columns of the Matrix file are required by the
GenomePixelizer 2D plotter. All other columns (expectation and alignment
length) are ignored. However, columns with expectation values and alignment
overlap are useful to validate the identity score. The matrix file can be
generated automatically from BLAST output using
tcl_blast_parser_123.tcl
script provided in the "Script" directory.
The "Coords file" contains this information:
column 1: chromosome number;
column 2: gene ID;
column 3: coordinates (position on the chromosome).
Positions on chromosome can be expressed in nucleotides, or in kilobases
(nt/1000), or megabases (nt/1000000, with the decimal portion representing
finer-scale positions). Expressing positions in megabases is recommended
because of compatibility with the previous (original) version of
GenomePixelizer.
Gene coordinates may be derived by parsing corresponding GenBank files or
editing *.ptt (protein translation tables) using an excel-like editor.
For full compatibility with the original version of GenomePixelizer, the
"Coords file" should also contain:
column 4: Watson/Crick gene orientation on chromosome;
column 5: gene color coding.
For details, see:
http://www.atgc.org/GenomePixelizer/GenomePixelizer_Welcome.html
The fourth and fifth columns are not required if you are using the
GenomePixelizer 2D plotter version only.
The annotation file is a tab delimited file with gene IDs in the first
column and their description in the second column. Annotation files may be
derived from fasta headers from corresponding sequence files.
NOTE ABOUT CHROMOSOME NUMBERING:
GenoPix_2D_Plotter reads and understands chromosome IDs as strings (not as integers). If your dataset has
more than 9 (nine) chromosomes you should label them as 01 02 03 ... 08 09 10 11 ... and so on. Program
does not work properly sometime if chromosomes are labeled by different way.
PROGRAM INTERFACE AND OPERATION
The current version of GenomePixelizer 2D plotter comes with example data
for the
Arabidopsis thaliana genome.
In the main window at program startup,
there are three entries for input files:
Gene Coordinates File: Matrix/ath_ncbi.coords
Matrix File: Matrix/ath_ncbi.matrix
Annotation File: Matrix/ath_ncbi.annotation
All three corresponding input files are located in the "Matrix"
directory. On some platforms, the full path may need to be entered in these
windows (e.g.
"/Users/yourhome/yourGenoPixDirectory/Matrix/ath_ncbi.coords").
Entries for different canvas parameters are displayed below input files,
they are:
Chr X dir(ection) - selected chromosome will be plotted on X axis
Chr Y dir(ection) - selected chromosome will be plotted on Y axis
Start X: 0, End: END
Start Y: 0, End: END
these parameters mean that all genes of selected chromosomes will be
plotted on the canvas. It is possible to specify a region of interest in
these entries. In this case only the selected region will be plotted on the
canvas.
Identity cutoff: 0.4 (by default) means that all hits with identity score
40% or better will be plotted on the 2D canvas for selected chromosomes.
Canvas size X: 1000 means that chromosome X will have 1000 pixels width on
the canvas. The Y coordinate for the canvas size will be re-calculated
automatically, proportionally to X chromosome size.
Size of gene (pixels): 3 means that gene "size" (dot or element) on canvas
will have size 3x3 in pixels.
To run a project you need to click on "Load Data" first. The program will
load data from the Coords file and the Matrix file into the computer
memory. By clicking on "Load Annotation," the program reads the Annotation
file.
After loading the Coords and Matrix data into the memory, you can click
"Plot Canvas". The program should create a canvas and plot the genes and
corresponding BLAST hits on it.
Note about program performance: It may take one or more minutes to plot one
chromosome against another. Performance on Windows machines is slower than
on Linux. Whole genome plotting (for example, all five Arabidopsis
chromosomes against themselves -- more than 27,000 genes and about 100,000
corresponding entries from Matrix file) may take a couple of hours.
Genes are plotted on the X and Y axes.
Each axis consists of two adjacent,
parallel lines, which can carry extra information. The outside line (upper
or left) shows the positions of genes which were not used in the dot plot
comparison, they do not have corresponding entries in Matrix file, while the
inside line (lower or right) shows genes that have matches in the comparison
genomic region above the score threshold.
COLOR SCHEME, NODE PAINTER AND TAG SYSTEM ON CANVAS
Originally all plotted elements on the canvas are colored proportionally to
identity scores from the Matrix file, with this color scheme:
if "White" color mode is chosen (default):
1.0 - 0.9 - black
0.9 - 0.8 - 90% gray
0.8 - 0.7 - 80% gray
0.7 - 0.6 - 70% gray
0.6 - 0.5 - 60% gray
0.5 - 0.4 - 50% gray
less than 0.4 - 40% gray
canvas background is white
if "Black" color mode is chosen:
1.0 - 0.9 - white
0.9 - 0.8 - 10% gray
0.8 - 0.7 - 20% gray
0.7 - 0.6 - 30% gray
0.6 - 0.5 - 40% gray
0.5 - 0.4 - 50% gray
less than 0.4 - 60% gray
canvas background is black
You can check the examples of "white mode"
and "black mode".
Each element being plotted on the 2D canvas has three IDs or tags. These
tags permit searching the canvas interactively for particular elements
(genes). There is a unique tag for every element. For example if gene "A"
has a hit to gene "B" then a unique tag "A_B" or "B_A" is assigned to
corresponding element. This allows finding the unique elements on the
canvas that correspond to any given gene pair.
If gene "A" has hits to genes "B", "C" and "D" then redundant tag "A" is
assigned to every element in a given set. For example: all dots "A_B",
"A_C" and "A_D" will have tag "A". This allows searching the canvas for all
elements with hits to gene "A".
The "Node painter" function allows a user to perform the search described
above. By clicking on the Node painter, a new window should appear.
Different groups of genes of interest are listed in separate files in the
directory "Group". They contain just gene IDs in a one column file format.
It is possible to reset the color scheme on the canvas back to gray scale
by pressing the button "Color Reset".
Several examples are contained in this distribution: group1 contains IDs of
putative resistance genes, group2 - cytochrome P450, group3 - putative
LRR-protein kinases, group4 - retrotransposon like elements, group5 - all
kinases detected by a HMM model, group6 - Leucine Rich Repeats. By clicking
on the "paint" button for a corresponding group, all elements on the canvas
that have redundant IDs for a given group will be painted in the selected
color. For example, if user clicks on the paint button of group 1, then all
elements (nodes) on the 2D plot will be painted in red if these have BLAST
hits to any resistance gene from the group1 file.
The "diagonal" group is a special group of the Node painter. It permits
displaying elements that have been identified as members of "diagonal"
features (syntenic regions). The group file that contains diagonal
groups has two columns with gene IDs. Each node will be painted according
to the pair of IDs in this file. For example, if line of diagonal group
contains IDs "A" and "B" the only element(s) which will be painted on
canvas are those that have the unique tag "A_B" or "B_A". Also, all genes
on the X and Y axes with tags "A" and "B" will be painted. The diagonals in
this distribution were identified using the DiagHunter perl script by
Steven Cannon. DiagHunter should be included in the Scripts directory, and
is also available via
http://www.tc.umn.edu/~cann0010.
MOUSE CLICKS, EVENTS BINDINGS, AND THE CANVAS EDITOR
A mouse click on any plotted element on the 2D canvas will generate blue
lines that cross on the selected element. Genes with hits to the selected
element will be highlighted with intersecting yellow lines. By mouse click
on genes plotted on X or Y axes green line should appear if gene has hit(s)
to another gene(s). Again, yellow lines crossed with green line indicate
positions for hits to selected gene.
A mouse click will also give annotation information for the selected element
in the "Annotation window," as well as gene IDs and their coordinates in the
"Gene IDs" window.
By using "Canvas Editor" it is possible to add simple graphical labels and
text to the canvas. It is possible to remove all labels with an active tag
by pointing mouse over element and holding "Shift" key with mouse click.
When the "Canvas Editor" window is open, point mouse cursor over any gene or
plotted pair, hold "Control" key and press mouse button, this action will
print gene IDs corresponding to selected element. Removing labels is possible
via holding "Shift" key and mouse click.
SEARCH BY GENE ID OR BY KEYWORD
Canvas is searchable for particular elements (genes) by gene ID or by keyword.
Search by keyword is available only in the case if an "Annotation File" is
loaded into computer memory. To search by gene ID type in the gene ID entry
window ID of the gene you like to find. Search will be successful if the
corresponding gene has been plotted on 2D canvas (it means it has BLAST hit(s)
within specified identity cutoff). In this case a green line on 2D canvas will
point to selected gene on X or Y axes and crossed yellow lines will highlight
all BLAST hits to that selected gene.
"Search by keyword" entry window is located on the bottom of "Annotation
Window". Search by keyword will perform search of annotation description
lines and will paint (highlight) all genes and corresponding BLAST hits with
the color scheme defined in "Canvas Editor" for graphical elements.
You can reset color back to gray scale by clicking on button "Color Reset"
from "Node Painter" window.
ZOOM IN FUNCTIONALITY
It is possible to select and display a region of interest in greater
detail. For example, for default example data set (the Arabidopsis thaliana
genome) select:
Chr X dir: 1 X Start: 19.1 X End: 23.4
Chr Y dir: 1 Y Start: 2.1 Y End: 6.3
Canvas Size X: 600
Then click "Plot Canvas". A canvas with only the selected region for the
given pair of chromosomes should appear, displaying one of the duplicated
regions of the Arabidopsis genome.
SAVING THE RESULTS OF YOUR WORK
Clicking on the button "Save as PostScript" will generate a PostScript file
with the selected filename.
DEFAULT DATA SET
The default data set represents the
Arabidopsis genome, downloaded from the
NCBI site in January 2003. All protein sequences have been BLAST-ed against
one another, with these options:
blastall -p blastp -F F -d ./ath_ncbi.fasta -i ./ath_ncbi.fasta -o
ath_vs_ath_ncbi_200_hits.blastp.out -e 1e-20 -v 200 -b 200
Results of the BLAST search were parsed by the tcl_blast_parser_123
(see http://cgpdb.ucdavis.edu/BlastParser/Blast_Parser.html ), with default
options. All hits with an expectation value 1e-20 or lower and identity of
40% or higher and an alignment overlap greater than 100 amino acids have
been compiled into the Matrix file (actually this matrix file was generated
automatically by tcl_blast_parser_123). Gene coordinates have been
extracted from the corresponding GenBank files.
Searches for domains (group files 1 through 6) were done using the
hmmsearch program (http://hmmer.wustl.edu).
To display all five Arabidopsis chromosomes at once select the
"x-ath_ncbi.coords" file in the "Gene Coordinates File" entry window.
Change X canvas size to 4000 pixels. Because more than 200,000 elements
will be plotted on canvas it will take a while for the plot to finish.
"x-ath_ncbi.coords" file is almost identical to "ath_ncbi.coords" with
exception that for all chromosomes number 1 was assigned and gene
coordinates were recalculated to form continuous long pseudochromosome.
Identification of duplicated regions in Arabidopsis (in the diagonal group
file) was made using Steven Cannon's DiagHunter perl script. DiagHunter you
can found in the "Scripts" directory, or at
www.tc.umn.edu/~cann0010
FEEDBACK
Feedback and comments are very welcome. Please email to:
Alexander Kozik akozik@atgc.org
GenomePixelizer 2D plotter is under the
GNU public license.
Copyright © 2003 University of California at Davis
|