IsoInfer V0.9

Source Code

Compilation

Install the following C/C++ libraries: glpk, gsl and QuadProg++. If you cannot install those packages in the standard system directories but install them at /your/installed/path/ for example, you have to modify the environment variables LD_LIBRARY_PATH and CXXFLAGS by :
```
export CXXFLAGS="-g -O3 -I/your/installed/path/include -L/your/installed/path/lib" 
export LD_LIBRARY_PATH="/your/installed/path/lib:"$LD_LIBRARY_PATH 
```
If the compiler complains that it cannot find the library even you installed them by default, you should also find the installed path of these libraries and specify the two environment variables above manually. This may happen on QuadProg installed on Ubuntu by "apt-get install" command.
Compile the graphlib package in the source code. The compilation follows the standard configure, make process by executing the following commands in sequence:
```
./configure
make
```
Compile the isoinfer package in the source code. The compilation follows the standard configure, make process by executing the following commands in sequence:
```
./configure
make
```

Note that it is necessary to keep the graphlib and isoinfer under the same directory. Environment variable LD_LIBRARY_PATH should be exported like the first step every time when you open a new shell to run the program. I suggest you putting it in the .bashrc file in your home directory.

Manual

Usage:

Jobs:

-h	Print help information
-ext_junc_ref	Extract junction ref sequence. In the involved parameters, -bound, -grange, -ref and -read_info are necessary, while -s, -tsspas and -rstart are optional.
-predict	Infer isoforms. In the involved parameters, -bound and -grange are necessary. Optional parameters are -s, -tsspas, -read_info, -conf_level, -intron_exp, -min_exp, -low_exp, -ps, -min_dup and -o.

Options:

Parameter	Range	Default Value	Description
-rstart	number	0	For job -ext_junc_ref, the parameter specifies the start position of the first neocliotide of a chromosome. This parameter is to make sure that the coordinations used in the program is consistent with the coordinations provided by -bound, -grange and -tsspas.
-bound	file	N/A	Boundary file. The format of the file is : chromosome strand position type Every two consecutive fields are separated by a single TAB. Each line corresponds to a boundary with a certain type. The position is always the position of the first base of an exon or intron. 'type' are binary. type = 0 for intron->exon, type = 1 for exon->intron. It is possible that a boundary is both type 0 and type 1. In this case, provide two lines for this boundary, with one line for type 0 and another line for type 1. If type information is unavailable, set types as 0 for all the boundaries.
-grange	file	N/A	Gene range file. The format of the file is : gene_name chromosome strand start_position end_position Every two consecutive fields are separated by a single TAB. Each line corresponds to a gene.
-tsspas	file	N/A	TSS and PAS file. The format of the file is : gene_name TSSs PASs Every two consecutive fields are separated by a single TAB. gene_name should be consistent with the gene range file (specified by -grange). TSSs or PASs are sepereted by commas. In each line, an isoform starting from one element in TSSs must end with some element in PASs on the same line. For a gene, multiple lines may be provided. There is no constraint on different lines. This parameter is optional. When this parameter is set, make sure that gene names are consistent with gene names provided by parameter -grange.
-ref	file	N/A	Reference sequence in a single file.
-read_info	file	N/A	A file storing the basic read information. The format of this file is: mapping_file 0/1 [end_len] cross_strength noise_level total_read_cnt distribution_type definition_of_a_distribution The format for the mapping_file is : chromosome strand start_positions end_positions Every two consecutive fields are separated by a single TAB. Each line corresponds to the mapping information of a read. Each read could be mapped to multiple segments of the reference sequence. 'start_positions' ('end_positions) are all the start (end) positions, separated by commas, of all the segments. The start/end positions should be sorted in a non-decreasing order. This format is similar to the BED format of USCS. For example, a paired-end read with end length 50 is mapped to the RefSeq. The first end of this paired-end read is mapped to segments: [300,330) and [700,720) on the positive strand of the chromosome chr1 and the second end of this paired-end read is mapped to segments: [1300,1315) and [2000,2035) on the positive strand of the chromosome chr1. Then the mapping file for this paired-end read should contain two consecutive lines: chr1 + 300,700 330,720 chr1 + 1300,2000 1315,2035 The second term in a read info file indicates whether the read is paired-end or not. If it is paired-end, set 1 here. Otherwise 0 should be set. If it is 1, then the first item in the following line is the length of each end of the paired-end read. The lengths of the two ends of a paired-end read is supposed to be the same. If the read is not paired-end, then the following line starts from cross_strength. In the second line of a read_info file, cross_strength is the minimum number of base pairs of agreement between a read and an exon when the read is considered as partially aligned to the exon. This parameter influence all the stuff related to junctions. noise_level is the estimated noise level in RPKM. When doing job -predict, this parameter will be used to reduce noise. In the RNA-seq experiment conducted in (Nat Methods. 2008 Jul;5(7):621-8), about 7% short reads are mapped to introns and intergenic regions, while introns and intergenic regions comprise more than 98% of the entire mouse genome. By taking into consideration of the fact that the size of the mouse genome is about 2.5G, the noise level could be estimated as less than 0.03 RPKM. total_read_cnt should be the total number of successfully mapped short reads. This number would not be known before the reads are actually loaded by IsoInfer. Therefore, put an arbitrary number here initially. The program will update this number automatically. definition_of_a_distribution describes the length distribution for single-end reads or the span distribution for paired-end reads. Currently, three type of distributions are supported: 0. Constant: The definition_of_a_distribution should be the_constant 1. Gaussian distribution: The definition_of_a_distribution should be mean standard_deviation 2. Customized: The definition_of_a_distribution should be length_cnt length probability length probability ... After one read info, another one could be followed in the same file. On job -ext_junc_ref, only the first read info in the file is effective. If, on some job, not all the information in the read info is usefully, then the unused items can be set to any value.
-s	T/F	F	Whether are the operations strand specific?
-min_exp	number	0	The minimum expression level in RPKM. By default, this parameter will not have influence on the result. The greater the value is, the higher sensitivity / lower precision the result achieves. For example, this parameter could be set as 1.
-intron_exp	number	3	When doing job -predict, a segment with expression level below this parameter will be considered as an intron. A carefully selected value for this parameter is critical. If this value is 0, then all the segments will considered as expressed, which will introduce noise segments in the following isoform predictions. If this value is too high, many expressed segments will be considered as introns, which will lower the sensitivity. The sequence bias, noise, multireads and inaccuracy of exon-intron boundary should be taken into consideration in the setting of -intron_exp. By our tests, 3~5 is a reasonable value for this parameter.
-low_exp	number	0	Genes with expression levels (RPKM) below this parameter would be filtered out. A larger value for this parameter leads to a better precision.
-min_dup	number	1	A junction is covered if at least -min_dup reads cover this junction.
-ps	number	7	Partition size. On whole mouse genome, the isoform inference process (Step4 in the following example) costs about 10 minutes on a standard PC with this default parameter. A larger value is supposed to lead to better results.
-conf_level	number in [0,1]	0.05	Set the confidence level.
-o	file	N/A	A file for output For job -ext_junc_ref, each junction forms two consecutive lines in the file. The odd line is the junction ID which is in the form of : chromosome\|position1\|position2\|cross_len\|Junc The following schematic graph defines position1 and position2 for a junction The even line is the concatenation of sequences [position1, position1+cross_len) and [position2, position2+cross_len) For job -predict, each line in the output is an predicted isoforms in a format similar to UCSC known genes: ID chromosome strand start_position end_position exon_start_positions exon_end_positions exp_in_rpkm The last column is the estimated expression levels of predicted isoforms in PRKM

Examples

The following example is based on single-end short reads. In the following example, an example read_info file and several useful scripts are provided. The usages of all the scripts are straightforward. Please read the script for the usages.

Generate files needed by -bound, -grange and -tsspas

Use a script knownGeneExtractor to extract all the three files from a knownGene table downloaded from UCSC.
```
./knownGeneExtractor knownGene
```
Use a script BoundfromTopHat to extract the file needed by -bound from the predicted junctions by TopHat.
```
./BoundfromTopHat junctions.bed
```

Generate a read_info file

If you have 10000000 single-end reads, read length is 50bp, read_mapping file is "your_mapped_reads", noise level is 0.1 RPKM and cross_strength is 2. The content of the read_info file should be:
```
your_mapped_reads 0
2 0.1 10000000 0
50
```
If you have 30000000 paired-end reads, end length is 30bp, read_mapping file is "your_mapped_reads", noise level is 0.1 RPKM, cross_strength is 2 and the span (the gap plus 2 times of the end length) of the paired-end read follows normal distribution N(200,10^2). The content of the read_info file should be:
```
your_mapped_reads 1
30 2 0.1 30000000 1
200 10
```

Generate a read mapping file

First, extract junction sequences non-strand-specifically using IsoInfer:
```
isoinfer -ext_junc_ref -s F -rstart 0 -bound Bound -grange GeneRange -tsspas TSSPAS -ref refseq -read_info read_info -o juncref
```
Note that the knownGene table (if it is used) and the reference sequence you downloaded should be consistent. Only the read length and cross strength in the read_info file is used in this step.
Second, use Bowtie to map single-end short reads to the reference sequence and junction sequences. You can use the script tranMappedRefReads to extract the mapping information of reads to the reference sequence from the default output of bowtie. You can use the script tranMappedJuncReads to extract the mapping information of reads to junction sequences from the default output of bowtie. Then put the output of these two scripts together, e.g. into file "mapped_reads".

Predict isoforms

Set -min_exp to 0.1. All segments with expression level below 3 RPKM are considered as introns. A junction is covered if it is covered by at least 2 reads. Set all other parameters to default.
```
isoinfer -predict -bound Bound -grange GeneRange -tsspas TSSPAS -intron_exp 3 
 -min_dup 2 -min_exp 0.1 -read_info read_info -o results
```

The TSS/PAS information is missing.

isoinfer -predict -bound Bound -grange GeneRange -read_info read_info -o results