Codon Usage Pipeline

Codon Usage Pipeline based on BLASTX search

Alexander Kozik and Richard Michelmore
University of California at Davis

This codon usage pipeline is based on analysis of results of ungapped BLASTX search. The pipeline is a set of Python/Tcl-Tk scripts which extracts open reading frames (ORFs) from BLAST alignments generated by BLASTX search and counts codon usage for query sequences as well as for subject sequences and generates report in the form of tab-delimited files. This approach is useful when we want to find and extract putative ORFs from a set of ESTs (query) by comparing (aligning) it with a set of sequences with known ORFs (subject in BLAST terminology). Dataflow of this pipeline is shown below:

Input files:

    1. "DNA" file which corresponds to open reading frames of the subject sequences
    2. "Protein" file which derived by translation of "DNA" file
    For example, Arabidopsis ORFs and their translated sequences were derived from NCBI GenBank files using Python GenBank parser:
    ath_NCBI_cds.fasta.nr.info.gz (Arabidopsis CDS or "DNA" file)
    ath_NCBI_genes.fasta.nr.info.gz (corresponding protein sequences or "protein" file)
    (Note, instead of Arabidopsis files user can use any other dataset as long as "protein" file is derived from "DNA" file)

    3. cos_soybean.fasta (example set of soybean ESTs or "Query" file for which we want to find ORFs)

Programs and scripts:

    1. NCBI Blast program (you need to download and install BLAST program from NCBI ftp site)
    2. tcl_blast_parser_123_V025.tcl (read more at: http://www.atgc.org/BlastParser/Blast_Parser_017.html)
    3. SeqsExtractorFromBlastX_024_V07d.py

Pipeline usage:

    1. run BLASTX search of cos_soybean.fasta (ESTs query set) against Arabidopsis protein database:

> blastall -p blastx
-F "m
S" -g F -d ./ath_NCBI_genes.fasta.nr.info 

    -i ./cos_soybean.fasta -o
x-cos_soybean_vs_ath_24hits_no_gaps.out 

    -e 1e-10 -v 24 -b 24

    (note options -F "m S" [soft filtering] and -g F [ungapped BLAST])
    after some time (1-2 hours on 1 GHz CPU) x-cos_soybean_vs_ath_24hits_no_gaps.out will be generated

    2. run tcl_blast_parser on BLASTX output:

>
tclsh tcl_blast_parser_123_V025.tcl
x-cos_soybean_vs_ath_24hits_no_gaps.out 

    x-cos_soybean_vs_ath_24hits_no_gaps.out 20 40 100
MATRIX

after parsing several files will be generated (see details at http://www.atgc.org/BlastParser/Blast_Parser_017.html).
We are interested in x-cos_soybean_vs_ath_24hits_no_gaps.out.info2 file only which contains following information:

1 column: "Query" sequence ID

2 column: "Subject" sequence ID

3 column: description line for "Subject" sequence

4 column: normalized expectation (-log(Exp)/100) [note, that expectation is normalized between 0 and 1]

5 column: percent of identity

6 column: number of perfect matches

7 column: length of the alignment

8 column: hit number for primary alignment (1 is the best first hit)

9 column: frame of translation (+1, +2, +3, -1, -2 or -3)

10 column: first position of the "Query" sequence in the alignment

11 column: last position of the "Query" sequence in the alignment

12 column: first position of the "Subject" sequence in the alignment

13 column: last position of the "Subject" sequence in the alignment

14 column: Length of "Query"(nt or aa) / Length of "Subject"(nt or aa)

15 column: gap info

    SeqsExtractorFromBlastX will use information from "*.info2" file to extract sub-sequences from query and subject

    3. run SeqsExtractorFromBlastX script as described below:

>
python SeqsExtractorFromBlastX_024_V07d.py cos_soybean.fasta
ath_NCBI_cds.fasta.nr.info 

    x-cos_soybean_vs_ath_24hits_no_gaps.out.info2
x-cos_soybean.codons x-cos_soybean.align

SeqsExtractorFromBlastX_024_V07d.py script takes five arguments as input:

1-st argument: cos_soybean.fastafile contains EST sequences (BLASTX query)
2-nd argument: ath_NCBI_cds.fasta.nr.info file contains DNA sequences of BLASTX database (subject)
3-d argument: x-cos_soybean_vs_ath_24hits_no_gaps.out.info2 file
4-th argument: x-cos_soybean.codons is output file(s) of SeqsExtractorFromBlastX script
5-th argument: x-cos_soybean.align is a directory where all extracted alignments will be written

    SeqsExtractorFromBlastX script will generate a new directory with DNA sequence alignments in FASTA format:
        x-cos_soybean.align
    Each alignment is extracted according to first best hit of BLASTX report and written into separate file. Also, each alignment corresponds to open reading frames only and does not contain stop codons within query or subject sequences. However alignments may contain unambiguous letters (N or X, for example).
    You can download compressed example output here: x-cos_soybean.align.tar.gz

    Other output files of the SeqsExtractorFromBlastX script:

x-cos_soybean.codons.overlap_info - contains positional information about extracted subsequences (e.g. from which part of original sequence any given subsequence was extracted)
x-cos_soybean.codons.query_seq - file with extracted "query" subsequences in FASTA format (fragments of soybean ESTs corresponding to ORFs according to BLASTX alignments of this example)
x-cos_soybean.codons.subj_seq - file with extracted "subject" subsequences in FASTA format (fragments of Arabidopsis ORFs corresponding to BLASTX alignments of this example)
x-cos_soybean.codons.triplets - info file with codon usage for "query" and "subject" sequences (counts all triplets)
x-cos_soybean.codons.x_kska - file with information about synonymous/non-synonymous substitutions for every non-gapped alignment
x-cos_soybean.codons.x_triplets - info file with codon usage for "query" and "subject" sequences (conservative case, e.g. this file contains information for triplets only for cases when both triplets (query and subject) are informative (do not contain unambiguous letters).

Detailed output explanation:

Column description of "*.triplets" and "*.x_triplets" files:

1-st column: amino acid
2-nd column: triplet
3-d column: "query" triplet usage per 100,000 cases [query_value]
4-th column: "subject" triplet usage per 100,000 cases [subject_value]
5-th column: [diff_value] difference between 3-d and 4-th columns ( [diff_value] = [query] minus [subject] )
6-th column: difference in percentages ( [diff_value%] = [diff_value] / [subject_value] * 100 )
7-th column: total number of the given triplet [query]
8-th column: total number of all analyzed triplets [query]
9-th column: total number of the given triplet [subject]
10-th column: total number of all analyzed triplets [subject]
11-th column: percentage of query codon usage
12-th column: percentage of subject codon usage
13-th column: difference between column 11 and 12 ( [query] minus [subject] )
14-th column "***" visual mark
15-th column: percentage of query amino acid usage
16-th column: percentage of subject amino acid usage
17-th column: difference between column 15 and 16 ( [query] minus [subject] )

(Note, that the total number of all analyzed triplets of query and the total number of all analyzed triplets of subject may be different in the file "*.triplets" and must be the same in "*.x_triplets")

Column description of "*x_kska" file:

1-st column: query ID
2-nd column: subject ID
3-d column: number of synonymous substitutions
4-th column: number of non-synonymous substitutions
5-th column: difference between columns 3 and 4
6-th column: "***" visual mark
7-th column: number of perfect codon matches
8-th column: number of amino acid matches
9-th column: total number of triplets in the alignment
10-th column: "***" visual mark
11-th column: identity (%) at protein level
12-th column: ratio of [perfect codon matches] to [amino acid matches] ((column7 / column8) * 100)

(Note, that the sum of values in columns 3 and 7 must be equal to value of column 8 ([synonymous substitutions] plus [perfect codon matches] = [amino acid matches])

Additional scripts to determine GC% content in a set of sequences:

Finally, we can calculate GC% content for our extracted EST query ORFs subset of sequences using seqs_processor_017_u.py script:

 
> python seqs_processor_017_u.py x-cos_soybean.codons.query_seq
x-cos_soybean.codons.query_seq.proc DNA

Script takes three arguments ([input], [output], ["DNA" option]) and generates three output files:

    We are interested in x-cos_soybean.codons.query_seq.proc.stat file which contains information about GC% content for every sequence as well as summary (last line)

    (Note, seqs_processor_017_u.py script will terminate if gene ID duplication takes place in the input file. seqs_processor_017_d.py should work with duplicated IDs)

    py_stat_graph_007.py script can process x-cos_soybean.codons.query_seq.proc.stat file and generate MS excel ready output to show distribution of DNA sequences proportional to their GC% content:

>
python py_stat_graph_007.py x-cos_soybean.codons.query_seq.proc.stat 

    x-cos_soybean.codons.query_seq.proc.stat.graph 1 100
8

py_stat_graph_007.py takes five arguments ( [input] [output] [min_value] [max_value] [column_number] ), analyzes defined column of input file and generates statistical information: x-cos_soybean.codons.query_seq.proc.stat.graph.

click here to find out how to obtain multiple alignments of orthologs by combining two subsets generated by pipeline described above.

DNA translation in six frames Python script

email: akozik@atgc.org
last modified: April 21 2006