tcl_blast_parser_123_V017.tcl
PROGRAM DESCRIPTION and USAGE
For example, we have a set of 4448 Cacao ESTs (Theobroma cacao) downloaded from NCBI in file: Cacao_EST.fasta.
We would like to BLAST this set against predicted proteins of Arabidopsis thaliana
Ath_NCBI.faa.gz and set up tab-delimited
tables which can be used as an input for mySQL database.
It is
assumed that NCBI
BLAST is installed on your computer. First, format the database file.
Type in the following command:
formatdb -i Ath_NCBI.faa -p T
There are three new
files generated. In this case, they are Ath_NCBI.faa.phr,
Ath_NCBI.faa.pin, and Ath_NCBI.faa.psq .
To run BLAST in command
line we need to execute a command:
blastall -p blastx -F T -d Ath_NCBI.faa -i Cacao_EST.fasta -o
Cacao_EST_vs_Ath_NCBI.blastX -e 1e-05 -I T -v 12 -b 12
where by
different options:
"blastall -p blastx"
to run blastx program (DNA against proteins)
"-F T" we want to use low complexity
filter
"-d Ath_NCBI.faa" our database "-i Cacao_EST.fasta" our input file
"-o Cacao_EST_vs_Ath_NCBI.blastX" our output file
"-e 1e-05" expectation cutoff value
"-I T" if GenBank IDs are present in database file they will be
displayed in output file
"-v 12 -b 12" we want to see in the output file up to 12 short description lines
for hits if found and up to 12 sequence alignments
After some time
(~ two hours on a 1 GHz computer), when the search is done BLAST
should generate Cacao_EST_vs_Ath_NCBI.blastX.gz
output file. Size of file is about 20 Mb. Then we can process BLAST output using
tcl_blast_parser_123_V017.tcl and generate tab delimited parsed files
suitable for further data analysis.
tcl_blast_parser_123_V017.tcl takes as input the
results of stand alone BLAST search. You need to run program in command
shell or DOS prompt with six arguments:
tcl_blast_parser_123_V017.tcl
[input_file] [output_file] [expect_cutoff] [identity_cutoff]
[overlap_cutoff] [matrix_option]
For example,
in UNIX and in DOS respectively:
tcl_blast_parser_123_V017.tcl
Cacao_EST_vs_Ath_NCBI.blastX Cacao_EST_vs_Ath_NCBI.blastX 20 40 100 GRAPH
tclsh
tcl_blast_parser_123_V017.tcl Cacao_EST_vs_Ath_NCBI.blastX Cacao_EST_vs_Ath_NCBI.blastX
20 40 100 GRAPH
in this case file with name
Cacao_EST_vs_Ath_NCBI.blastX will be analyzed by parser and ten new
files with names:
Cacao_EST_vs_Ath_NCBI.blastX.adj_list
Cacao_EST_vs_Ath_NCBI.blastX.all_hits
Cacao_EST_vs_Ath_NCBI.blastX.best_hit
Cacao_EST_vs_Ath_NCBI.blastX.blast_stat
Cacao_EST_vs_Ath_NCBI.blastX.group_degree_info
Cacao_EST_vs_Ath_NCBI.blastX.id_list
Cacao_EST_vs_Ath_NCBI.blastX.info1
Cacao_EST_vs_Ath_NCBI.blastX.info2
Cacao_EST_vs_Ath_NCBI.blastX.matrix
Cacao_EST_vs_Ath_NCBI.blastX.two_hits
will
be generated upon analysis. You can specify output file name the same as
input file name, parser will add file extensions to each type of output,
so input file will remain without any changes.
Below is the
detailed description of every file produced by parser:
Cacao_EST_vs_Ath_NCBI.blastX.all_hits
contains
info about ALL HITS in blast report
- 1 column: "Query" sequence ID
- 2 column: "Subject" sequence ID
- 3 column: normalized expectation (-log(Exp))
- 4 column: percent of identity
- 5 column: number of perfect matches
- 6 column: length of the alignment
- 7 column: hit number for primary alignment (1 is the best
first hit)
- 8 column: hit number for alternative alignment
- 9 column: PRM stands for primary alignment, ALT for
alternative alignment
- 10 column: frame of translation (+1, +2, +3, -1, -2 or -3)
- 11 column: first position of the "Query" sequence in the alignment
- 12 column: last position of the "Query" sequence in the alignment
- 13 column: first position of the "Subject" sequence in the
alignment
- 14 column: last position of the "Subject" sequence in the
alignment
Cacao_EST_vs_Ath_NCBI.blastX.best_hit
contains
info about FIRST BEST hits only
- 1 column: "Query" sequence ID
- 2 column: "Subject" sequence ID
- 3 column: normalized expectation (-log(Exp))
- 4 column: percent of identity
- 5 column: number of perfect matches
- 6 column: length of the alignment
- 7 column: hit number for primary alignment (1 is the best
first hit)
Cacao_EST_vs_Ath_NCBI.blastX.blast_stat contains
info HOW MANY HITS every "query" sequence has. If it is
just a single hit, special mark "SINGLE_HIT" is placed at an additional
column
Cacao_EST_vs_Ath_NCBI.blastX.id_list list
of all gene IDs found in Cacao_EST_vs_Ath_NCBI.blastX file including all "query"
IDs as well as all "subject" IDs
Cacao_EST_vs_Ath_NCBI.blastX.matrix pairwise
(binary) matrix for all primary hits if they are better than 20 (1e-20)
expectation cutoff value, 40% identity and 100 aa (or nt) alignment
overlap length. The three values of expectation, identity and overlap
cutoff are the third, fourth and fifth arguments correspondingly when the
script is executed from command line.
- 1st and 2nd column: pair of genes
- 3rd column: IDENTITY value normalized between 0 and 1
- 4th column: normalized expectation (-log(Exp))
- 5th column: length of the alignment
Cacao_EST_vs_Ath_NCBI.blastX.two_hits if
"query" has two or more hits, additional data will be written into the
"TWO HITS" file. This file compares expectation, identity and
alignment length scores for the best first hit to the next after him. If
differences of these scores are greater than cutoff values defined in the
body of parser, then special mark "GOOD_DIFF" will be added to the last
column. Otherwise the differences are considered as "BAD_DIFF". Values
for cutoff criteria are:
- diff expectation values should be greater than 20 (1e-20)
- identity scores for second hit should be two times less than for first hit
- alignment length generated by BLAST for second hit should be shorter than for first hit.
This file is really
useful when you run one set of genes against another and want to find a set of
gene (EST) which have single strong hits to a target genome (Arabidopsis in our case).
For example, Lettuce/Sunflower
COS candidates (Conserved Orthologs
Set) BLAST search against Arabidopsis genome and its
validation by tcl_blast_parser_123_V017.tcl
Cacao_EST_vs_Ath_NCBI.blastX.out.adj_list is
an Adjacency List for protein sequences based on Matrix
File. If query (gene, first column) is similar to other genes
(subject in BLAST report) within defined cutoff values (expectation,
identity and alignment length) then these gene IDs are written into
corresponding row. In other words, this is a list of nodes (genes) with
direct connection to each others.
Cacao_EST_vs_Ath_NCBI.blastX.group_degree_info is
a Group Degree Info file. Genes (protein sequences) may have
similarity to each other via transitive homology. Group Degree
Info file represents information about clusters of genes belonging
to distinct groups. Each group in Group Degree Info file is a
connected graph. It means each node (gene) can reach all other nodes
(genes) via edges (identity scores).
- 1 column -- Node (gene) ID
- 2 column -- Number of other genes connected to the current gene
- 3 column -- Graph size (number of genes in the graph)
- 4 column -- Group number
- 5 column -- Either "****" or nothing. The "****" indicates the
beginning of a new group. This is only for visual purpose.
tcl_blast_parser_123_V017.tcl can easily cluster
group of genes up to 1000 in a set. To process a set of genes with
number greater than 1000 in a group we recommend to use Graph9
program. It is written in C. It works much faster and provides
additional information about articulation point (or bridges) which helps
identify multidomain sequences. Also, if you are using tcl_blast_parser_123_V017.tcl
with MATRIX option, Adjacency List and Group Degree Info are not
generated. Parser in this case will stop the BLAST output analysis at the creation
of Matrix file and total time of analysis will be much shorter.
Finally, description of Info1 and Info2 files:
Cacao_EST_vs_Ath_NCBI.blastX.info1
contains
info about FIRST BEST hits only and description line of the subject from BLAST report
- 1 column: "Query" sequence ID
- 2 column: "Subject" sequence ID
- 3 column: description line for "Subject" sequence
- 4 column: normalized expectation (-log(Exp)/100) [note, that expectation is normalized between 0 and 1]
- 5 column: percent of identity
- 6 column: number of perfect matches
- 7 column: length of the alignment
- 8 column: hit number for primary alignment (1 is the best
first hit)
Cacao_EST_vs_Ath_NCBI.blastX.info2
contains
info about ALL PRIMARY HITS in blast report. It is almost identical to All Hits file with exception that
it contains description line for subject, expectation is normalized between 0 and 1 and there is no information for
ALT - alternative alignments
- 1 column: "Query" sequence ID
- 2 column: "Subject" sequence ID
- 3 column: description line for "Subject" sequence
- 4 column: normalized expectation (-log(Exp)/100) [note, that expectation is normalized between 0 and 1]
- 5 column: percent of identity
- 6 column: number of perfect matches
- 7 column: length of the alignment
- 8 column: hit number for primary alignment (1 is the best
first hit)
- 9 column: frame of translation (+1, +2, +3, -1, -2 or -3)
- 10 column: first position of the "Query" sequence in the alignment
- 11 column: last position of the "Query" sequence in the alignment
- 12 column: first position of the "Subject" sequence in the
alignment
- 13 column: last position of the "Subject" sequence in the
alignment
- 14 column: Length of "Query"(nt or aa) / Length of "Subject"(nt or aa)
Tables like "Info2" can be used as input data table for mySQL database. For example:
Lettuce_vs_Arabidopsis table at CGPDB.
It has phpMyAdmin interface to manage data. phpMyAdmin provides
basic tools to query and sort data in database via web browser.
DOWNLOAD: tcl_blast_parser_123_V017.tcl
Tcl/Tk ToolKit
tcl_blast_parser_123_V012.tcl
- previous version (almost identical to tcl_blast_parser_123_V017 with exception that last column #14 of
Info2 file (Query Length/Subject Length) is not generated.
Work in progress! This version of BLAST parser
tcl_blast_parser_123_V025.tcl
extracts query and subject sequences from BLAST report and writes them into separate file "*.align".
For example:
Cacao_EST_vs_Ath_NCBI.blastX.align will be generated with Cacao BLAST example above. This is very useful when we want to extract sequence alignment from BLAST report.
Further Development:
tcl_blast_parser_123_V033.tcl
- to extract sequences in tab-delimited format from BLAST report
tcl_blast_parser_123_V038.tcl
- current working version
Download BLAST 2.2.10 binaries (win, linux, mac and sparc)
BLAST_2_2_10_BIN.tar.gz
|