Tools and scripts

Here are some simple tools and scripts I wrote; many of them are single scripts for next-generation sequencing.

GTF2BED: converting Cufflinks GTF files to BED predictions

GETINSERTSIZE: Extract insert size information of paired-end reads from Cufflinks output

Estimating paired-end read insert length from SAM/BAM files



GTF2BED: converting Cufflinks GTF files to BED predictions

(This tool is originally posted at blogspot: link)

Cufflinks writes its predictions as .GTF files. However, this file type is sometimes too large to process. For example, it may be too large to upload to UCSC gene browser (even you compress it). IGV browser also takes more memory resources to process .GTF file than processing BED file. So I wrote a small script (in Python 3) to convert GTF formatted files to BED files. This helps to reduce the file size dramatically. For example, this script converts a 120M GTF file to only 9M BED file, reducing the size by more than 90%!

Of course there are general tools to convert .GTF to .BED. For example, here is one Perl script to do this. However, my script is written specifically for Cufflinks .GTF files: it recognizes "gene_id", "transcript_id" and "FPKM" key word in attribute lists. "transcript_id" is converted as the name field in .BED file, and "FPKM" value is rounded as "score" field in .BED file.

Here is one example of .GTF file predicted by Cufflinks:


chr1    Cufflinks       transcript      934797  935655  324     -       .       gene_id "CUFF.29"; transcript_id "CUFF.29.3"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021"; full_read_support "yes";
chr1    Cufflinks       exon    934797  934812  324     -       .       gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "1"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";
chr1    Cufflinks       exon    934906  935655  324     -       .       gene_id "CUFF.29"; transcript_id "CUFF.29.3"; exon_number "2"; FPKM "23.5107601622"; frac "0.216036"; conf_lo "20.031437"; conf_hi "26.990083"; cov "84.378021";

The converted .BED file includes only one line as follows:

chr1    934796  935655  CUFF.29.3       24      -       934796  935655  0,0,255 2       16,750  0,109

The record in .BED file doesnot contain information like "frac", "conf_low", "conf_hi", "cov". But it uses only one line instead of 4 lines. Notice that the transcript_id is kept and the FPKM value (23.5) is rounded to 24 in score field of .BED file.

Feel free to comment or ask questions.



 GETINSERTSIZE: Extract insert size information of paired-end reads from Cufflinks output

(originally published in blogspot: link)

The new version of Cufflinks tries to estimate the mean and std value of paired-end read insert size automatically. Sometimes we need to extract such information to check our sequencing data. Here I wrote a simple script to search Cufflinks outputs.

The script is written in Python. The implementation is easy: search for key words (like "Estimated Mean", etc) in log files.


Usage: getinsertsize [cufflinks log file]

Note: you may specify different log files using filename wildcards.

Sample output:

File    MapMass ReadLength      Mean    Std
cufflinksinvoke.sh.e2148442     34108402.45     85      150.35  19.18
cufflinksinvoke.sh.e2148443     31007287.59     85      155.19  17.45
cufflinksinvoke.sh.e2148444     37286038.02     85      123.87  25.60


Source code:



Estimating paired-end read insert length from SAM/BAM files

(originally published in blogspot: link)

I wrote a single Python script to estimate the paired-end read insert length (or fragment length) from read mapping information (i.e., SAM/BAM files). The algorithm is simple: check the PNEXT field in the SAM format, throw out pair-end reads whose pairs are too far away, and use them to estimate the mean and variance of the insert length.

This script is distributed in GitHub below.

Usage:

getinsertsize.py [ SAM file ]


or


samtools view [ BAM file ] | getinsertsize.py

Sample output:

Read length: mean 90.6697303194, STD=15.9446036414
Possible read length and their counts:
{108: 43070882, 76: 50882326}
Read span: mean 165.217445903, STD=32.8914834802


Note: If the SAM/BAM file size is too large, it is accurate enough to estimate based on a few reads (like 1 millioin). In this case, you can run the script as follows:

head -n 1000000 [ SAM file ] |  getinsertsize.py 

or

samtools view [ BAM file ] | head -n 1000000 | getinsertsize.py



by Wei Li