IsoInfer0.5

  1. Introduction
  2. IsoInfer is a C/C++ program to infer isoforms based on short RNA-Seq (single-end and paired-end) reads, exon-intron boundary and TSS/PAS information. This version of IsoInfer uses a unified way to handle different types of short reads with different lengths. The source code is provided for non-commercial usage. We appologize for the unavailability of the Windows and Mac versions of this program at the present time.


  3. Compliation
  4. The source code contains two packages: graphlib and IsoInfer. To compile the source code, you should

    Note that it is necessary to keep the graphlib and isoinfer under the same directory. Environment variable LD_LIBRARY_PATH should be exported like the first step every time when you open a new shell to run the program. I suggest you putting it in the .bashrc file in your home directory.


  5. Manual (Under Construction)
  6. Usage:

        IsoInfer <Job> <Options>

    Jobs:

    -h Print help information
    -ext_junc_ref Extract junction ref sequence. -rstart, -bound, -grange, -tsspas, -ref, -read_info are required.
    -gen_instance Generate instances of problem for IsoInfer. Expression level will be used to define expressed segments. A segment is expressed if the expression level on this segment is above the expression level specified by -noise. -bound, -grange, -tsspas and -read_info are required.
    -predict Infer isoforms provided the instances generated by -predict. -ins, -conf_level, -minexp, -mindup, -ps, -bpe, -bse are required.

    Options:


    -rstart number For job -ext_junc_ref, the parameter specifies the start position of the first neocliotide of a chromosome. This parameter is to make sure that the coordinations used in the program is consistent with the coordinations provided by -bound, -grange and -tsspas. Default 0.
    -bound file Boundary file. The format of the file is :
          chromosome strand position type
    • Every two consecutive fields are separated by a single TAB.
    • Each line corresponds to a boundary with a certain type.
    • The position is always the position of the first base of an exon or intron.
    • 'type' are binary. type = 0 for intron->exon, type = 1 for exon->intron.
    • It is possible that a boundary is both type 0 and type 1. In this case, provide two lines for this boundary, with one line for type 0 and another line for type 1.
    • If type information is unavailable, set types as 0 for all the boundaries.
    -grange file Gene range file. The format of the file is :
          gene_name chromosome strand start_position end_position
    • Every two consecutive fields are separated by a single TAB.
    • Each line corresponds to a gene.
    -tsspas file TSS and PAS file. The format of the file is :
          gene_name TSSs PASs
    • Every two consecutive fields are separated by a single TAB.
    • gene_name should be consistent with the gene range file (specified by -grange).
    • TSSs or PASs are sepereted by commas. In each line, an isoform starting from one element in TSSs must end with some element in PASs on the same line.
    • For a gene, multiple lines may be provided. There is no constraint on different lines.
    -ref file Reference sequence in a single file.
    -m file A file containing the mapping information of short reads to the ref sequence. The format of this file is:
          chromosome strand start_positions end_positions
    • Every two consecutive fields are separated by a single TAB.
    • Each line corresponds to the mapping information of a read. Each read could be mapped to multiple segments of the reference sequence. 'start_positions' ('end_positions) are all the start (end) positions, seperated by commas, of all the segments. This format is similar to the BED format of USCS.
    -read_info file A file storing the basic read information. The format of this file is:
          mapping_file 0/1
          [end_len] cross_strength noise_level total_read_cnt distribution_type
          "definition of a distribution"
    The format for the mapping_file is :
          chromosome strand start_positions end_positions
    • Every two consecutive fields are separated by a single TAB.
    • Each line corresponds to the mapping information of a read. Each read could be mapped to multiple segments of the reference sequence. 'start_positions' ('end_positions) are all the start (end) positions, seperated by commas, of all the segments. This format is similar to the BED format of USCS.
    The second term in a read info file indicates whether the read is paired-end or not. If it is paired-end, set 1 here. Otherwise 0 should be set. If it is 1, then the first item in the following line is the length of each end of the paired-end read. The length of the two ends of a paired-end read is supposed to be the same. If the read is not paired-end read, then the following line starts from "cross_strength". Currently, three type of distribution are supported:
    • 0 Constant: The "definition of a distribution" should be
            the_constant
    • 1 Gaussian distribution: The "definition of a distribution" should be
            mean standard_deviation
    • 2 Customized: The "definition of a distribution" should be
            value_cnt
            value probability
            value probability
            ...
    For example, for paired-end reads with span distribution N(300, 30^2) and end length 20. If the mapping file is "my_map_file", cross strength is 3, noise level is 5 RPKM, the number of total reads is 10M. Then the read info file should be:
          my_map_file 1
          20 3 5 10000000 1
          300, 30
    After one read info, another one could be followed in the same file. On job -ext_junc_ref, only the first read info in the file is effective. If, on some job, not all the information in the read info is usefully, then the unused items can be set to any value.
    -s T/F Whether the operations are strand specific or not? Default F
    -ins file A file containing instances.
    -bse T/F Use the TSS/PAS information or not.
    -min_exp number The minimum expression level. Default 0.
    -min_dup number The minimum effective duplication of part comb. Default 1. This parameter is effective when paired-end reads are available.
    -ps number Partition size. Default 7. On whole mouse genome, the isoform inference process (Step4 in the following example) costs about 10 minutes on a standard PC with this default parameter. A larger value is supposed to lead to better results.
    -noise number The noise level in RPKM. When doing job -gen_instance, a segment with expression level below the number specified in this parameter will be considered as an intron. Default 0
    -conf_level number in [0,1] Set the confidence level. Default 0.05.
    -o file A file for output
    • For job -ext_junc_ref, each junction forms two consecutive lines in the file. The odd line is the junction ID which is in the form of :
            >chromosome|position1|position2|cross_len|Junc
      The following schematic graph defines position1 and position2 for a junction

      The even line is the concatenation of sequences [position1, position1+cross_len) and [position2, position2+cross_len)
    • For job -gen_instance, it is OK to treat the output as a black box :-)
    • For job -predict, each line in the output is an predicted isoforms in a format similar to UCSC known genes:
      ID chromosome strand start_position end_position exon_start_positions exon_end_positions



  7. Example: (Under Construction)
  8. The following example is based on single-end short reads. In the following example, an example read_info file and several useful scripts are provided. Because of security reasons, the ".pl" suffix of all the scripts are deleted. The usages of all the scripts are straight forward. Please read the script for the usages.

    1. Use a script knownGeneExtractor to extract the required files needed by -bound -grange and -tsspas from a knownGene table downloaded from UCSC

            $./knownGeneExtractor knownGene


    2. Modify the the example read_info file as you want.


    3. Extract junction reference sequence non-strand-specifically

            $isoinfer -ext_junc_ref -s F -rstart 0 -bound Bound -grange GeneRange -tsspas TSSPAS -ref refseq -read_info read_info -o juncref

      Note that the known gene table and the reference sequence you downloaded should be consistent. Only the read length and cross strength in the read_info file is used in this step.

    4. Use Bowtie to map the short reads to the reference sequence and junction sequences. You can use the script tranMappedRefReads to extract the mapping information of reads to the reference sequence from the default output of bowtie. You can use the script tranMappedJuncReads to extract the mapping information of reads to junction sequences from the default output of bowtie. Then put the output of these two scripts together, e.g. into file "mapped_reads".


    5. Modify the "mapping_file", "total_read_cnt" in the example read_info file. The "mapping_file" should be "mapped_reads" and "total_read_cnt" should be the total number of mapped short reads including those mapped to the reference sequence and to the junction sequences.


    6. Generate instances for IsoInfer.

            $isoinfer -gen_instance -bound Bound -grange GeneRange -tsspas TSSPAS -read_info read_info -o my_instances

    7. Predict isoforms given input 'my_instances'. Set -minexp to 1 and all other parameters to default.

            $isoinfer -predict -ins my_instances -bse T -bs2 T -min_exp 1 -oformat 2 -read_info read_info -o results


  9. Reference
  10. Jianxing Feng, Wei Li and Tao Jiang. Inference of isoforms from short sequence reads. 2010. Accepted by RECOMB 2010.


  11. Questions and feedback
  12. Please email to:

    jianxing

    TA

    cs.ucr.edu