IsoInfer V0.8.2

  1. Source Code
  2. The source code contains two packages: graphlib v0.3 IsoInfer v0.8.2

  3. Compilation
  4. To compile the source code, you should
    1. Install the following C/C++ libraries: glpk, gsl and QuadProg++. If you cannot install those packages in the standard system directories but install them at /your/installed/path/ for example, you have to modify the environment variables LD_LIBRARY_PATH and CXXFLAGS by :

      export CXXFLAGS="-g -O3 -I/your/installed/path/include -L/your/installed/path/lib" 
      export LD_LIBRARY_PATH="/your/installed/path/lib:"$LD_LIBRARY_PATH 
      
      If the compiler complains that it cannot find the library even you installed them by default, you should also find the installed path of these libraries and specify the two environment variables above manually. This may happen on QuadProg installed on Ubuntu by "apt-get install" command.
    2. Compile the graphlib package in the source code. The compilation follows the standard configure, make process by executing the following commands in sequence:

      ./configure
      make
      
    3. Compile the isoinfer package in the source code. The compilation follows the standard configure, make process by executing the following commands in sequence:

      ./configure
      make
      

    Note that it is necessary to keep the graphlib and isoinfer under the same directory. Environment variable LD_LIBRARY_PATH should be exported like the first step every time when you open a new shell to run the program. I suggest you putting it in the .bashrc file in your home directory.


  5. Manual
  6. Usage:

        IsoInfer <Job> <Options>

    Jobs:

    -h Print help information
    -ext_junc_ref Extract junction ref sequence. In the involved parameters, -bound, -grange, -ref and -read_info are necessary, while -tsspas and -rstart are optional.
    -predict Infer isoforms. In the involved parameters, -bound and -grange are necessary. Optional parameters are -tsspas, -read_info, -conf_level, -intron_exp, -min_exp, -ps, -min_dup and -o.

    Options:


    Parameter Range Default Value Description
    -rstart number 0 For job -ext_junc_ref, the parameter specifies the start position of the first neocliotide of a chromosome. This parameter is to make sure that the coordinations used in the program is consistent with the coordinations provided by -bound, -grange and -tsspas.
    -bound file N/A Boundary file. The format of the file is :
      chromosome  strand  position  type
    • Every two consecutive fields are separated by a single TAB.
    • Each line corresponds to a boundary with a certain type.
    • The position is always the position of the first base of an exon or intron.
    • 'type' are binary. type = 0 for intron->exon, type = 1 for exon->intron.
    • It is possible that a boundary is both type 0 and type 1. In this case, provide two lines for this boundary, with one line for type 0 and another line for type 1.
    • If type information is unavailable, set types as 0 for all the boundaries.
    -grange file N/A Gene range file. The format of the file is :
      gene_name chromosome  strand  start_position end_position
    • Every two consecutive fields are separated by a single TAB.
    • Each line corresponds to a gene.
    -tsspas file N/A TSS and PAS file. The format of the file is :
      gene_name TSSs PASs
    • Every two consecutive fields are separated by a single TAB.
    • gene_name should be consistent with the gene range file (specified by -grange).
    • TSSs or PASs are sepereted by commas. In each line, an isoform starting from one element in TSSs must end with some element in PASs on the same line.
    • For a gene, multiple lines may be provided. There is no constraint on different lines.
    This parameter is optional.
    -ref file N/A Reference sequence in a single file.
    -read_info file N/A A file storing the basic read information. The format of this file is:
    mapping_file 0/1
    [end_len] cross_strength noise_level total_read_cnt distribution_type
    definition_of_a_distribution
    
    The format for the mapping_file is :
      chromosome strand start_positions end_positions
    • Every two consecutive fields are separated by a single TAB.
    • Each line corresponds to the mapping information of a read. Each read could be mapped to multiple segments of the reference sequence. 'start_positions' ('end_positions) are all the start (end) positions, separated by commas, of all the segments. The start/end positions should be sorted in a non-decreasing order. This format is similar to the BED format of USCS.

    For example, a paired-end read with end length 50 is mapped to the RefSeq. The first end of this paired-end read is mapped to segments: [300,330) and [700,720) on the positive strand of the chromosome chr1  and the second end of this paired-end read is mapped to segments: [1300,1315) and [2000,2035) on the positive strand of the chromosome chr1. Then the mapping file for this paired-end read should contain two consecutive lines:

    chr1 + 300,700 330,720 
    chr1 + 1300,2000 1315,2035 
    

    The second term in a read info file indicates whether the read is paired-end or not. If it is paired-end, set 1 here. Otherwise 0 should be set. If it is 1, then the first item in the following line is the length of each end of the paired-end read. The lengths of the two ends of a paired-end read is supposed to be the same. If the read is not paired-end, then the following line starts from cross_strength.

    In the second line of a read_info file, cross_strength is the minimum number of base pairs of agreement between a read and an exon when the read is considered as partially aligned to the exon. This parameter influence all the stuff related to junctions.

    noise_level is the estimated noise level in RPKM. When doing job -predict, this parameter will be used to reduce noise. In the RNA-seq experiment conducted in (Nat Methods. 2008 Jul;5(7):621-8), about 7% short reads are mapped to introns and intergenic regions, while introns and intergenic regions comprise more than 98% of the entire mouse genome. By taking into consideration of the fact that the size of the mouse genome is about 2.5G, the noise level could be estimated as less than 0.03 RPKM.

    total_read_cnt should be the total number of successfully mapped short reads. This number would not be known before the reads are actually loaded by IsoInfer. Therefore, put an arbitrary number here initially. The program will update this number automatically.

    definition_of_a_distribution describes the length distribution for single-end reads or the span distribution for paired-end reads. Currently, three type of distributions are supported:

    1. Constant: The definition_of_a_distribution should be
        the_constant 
    2. Gaussian distribution: The definition_of_a_distribution should be
        mean standard_deviation 
    3. Customized: The definition_of_a_distribution should be
      length_cnt
      length probability 
      length probability 
      ... 
      

    After one read info, another one could be followed in the same file. On job -ext_junc_ref, only the first read info in the file is effective. If, on some job, not all the information in the read info is usefully, then the unused items can be set to any value.

    -s T/F F Whether are the operations strand specific?
    -min_exp number 0 The minimum expression level in RPKM. By default, this parameter will not have influence on the result. The greater the value is, the higher sensitivity / lower precision the result achieves. For example, this parameter could be set as 1.
    -intron_exp number 3 When doing job -predict, a segment with expression level below this parameter will be considered as an intron. A carefully selected value for this parameter is critical. If this value is 0, then all the segments will considered as expressed, which will introduce noise segments in the following isoform predictions. If this value is too high, many expressed segments will be considered as introns, which will lower the sensitivity. The sequence bias, noise, multireads and inaccuracy of exon-intron boundary should be taken into consideration in the setting of -intron_exp. By our tests, 3~5 is a reasonable value for this parameter.
    -min_dup number 1 A junction is covered if at least -min_dup reads cover this junction.
    -ps number 7 Partition size. On whole mouse genome, the isoform inference process (Step4 in the following example) costs about 10 minutes on a standard PC with this default parameter. A larger value is supposed to lead to better results.
    -conf_level number in [0,1] 0.05 Set the confidence level.
    -o file N/A A file for output
    • For job -ext_junc_ref, each junction forms two consecutive lines in the file. The odd line is the junction ID which is in the form of :

        chromosome|position1|position2|cross_len|Junc

      The following schematic graph defines position1 and position2 for a junction

      The even line is the concatenation of sequences [position1, position1+cross_len) and [position2, position2+cross_len)

    • For job -predict, each line in the output is an predicted isoforms in a format similar to UCSC known genes:

        ID chromosome strand start_position end_position exon_start_positions exon_end_positions



  7. Examples
  8. The following example is based on single-end short reads. In the following example, an example read_info file and several useful scripts are provided. The usages of all the scripts are straightforward. Please read the script for the usages.

    1. Generate files needed by -bound, -grange and -tsspas
    2. Generate a read_info file
    3. Generate a read mapping file
    4. Predict isoforms