As a fundamental tool for discovering genes involved in a disease or biological process, differential gene expression analysis plays an important role in genomics research. High throughput sequencing technologies, e.g., RNA-Seq, are increasingly being used for differential gene expression analysis which was dominated by the microarray technology in the past decade. However, inferring differential gene expression based on the observed difference of RNA-Seq read counts has unique challenges that were not present in microarray-based analysis. The differential expression estimation may be biased against low read count values such that the differential expression of genes with high read counts is more easily detected. The estimation bias may further propagate in downstream analyses at the systems biology level if it is not corrected.
To obtain a better inference of differential gene expression, we have proposed a new efficient algorithm based on a Markov random field (MRF) model, called MRFSeq, that uses additional gene coexpression data to enhance the prediction power. Our main technical contribution is the careful selection of the clique potential functions in the MRF so its maximum a posteriori (MAP) estimation of can be reduced to the well-known maximum flow problem and thus solved in polynomial time. By using incorporating the loopy belief propagation technique, MRFSeq is able to not only call differentially expressed (DE) genes but also assign confidence scores to each inferred DE gene. Our extensive experiments on simulated and real RNA-Seq datasets demonstrate that MRFSeq is more accurate and less biased against genes with low read counts than the existing methods based on RNA-Seq data alone. For example, on the well-studied MAQC dataset, MRFSeq improved the sensitivity from 11.6% to 38.8% for genes with low read counts.
This program is developed and tested on Linux and by g++ 4.4.5
1. Install COXPRESSdb (http://coxpresdb.jp/).
2. Install R and DESeq (http://www-huber.embl.de/users/anders/DESeq/)
3. Install MRFSEQ
The installation is super easy by following the steps below:
$ tar zfxv <the tar ball you just download, e.g., mrfseq_0.1.tar.gz>
$ cd mrfseq_0.1
After this, a binary file MRFSEQ will be created in the directory where you unzipped the source code. To use MRFSEQ, you need specify the name of the input read counts file and the path where COXPRESSdb is installed.
./MRFSEQ <the path to COXPRESSdb > <the input read count file>
MRFSeq needs two types of files as the input: (1) a read count file and (2) co-expression database files. For (1), we include two read count files, MAQC.txt and alexa.txt, in the tar ball of MRFSeq as examples to demonstrate how to execute MRFSeq.
Format of Read Count Files
Assume we have n genes in our dataset and there are m replicates of 2 biological samples. The input read count file consists of n rows and each row contains 2m+1 fields which are delimited by a tab \t. The first field should be the Entrez ID of genes followed by m read counts in one of the two biological samples and m read counts in the other sample.
10 0 0 0 0 0 1 0 0 1 0 0 0 0 1
100 8 10 15 10 13 14 10 275 269 296 259 273 257 279
10024 0 0 0 0 0 1 1 86 91 121 111 112 102 103
100505765 1 1 1 1 1 1 1 1 1 1 1 1 1 1
. . . .
Format of Co-Expression Database Files
MRFSeq uses Hsa.coex.v6 from COXPRESSdb for all the experiments on the MAQC and Griffths datasets. Currently, all user-defined co-expression data should be transformed to the format of COXPRESSdb. For every gene in the read count file, there should be one file, named by the Entrez ID of the gene, under the directory of co-expression database files. Both Hsa.coex.v6 and the detailed file format can be found by the following URL (http://coxpresdb.jp/download.shtml).
How to test whether your DESeq is successfully installed or not?
After installing R and DESeq in your computer, you can imply run
$ Rscript tmp_deseq.R
in the directory where you installed MRFSeq.
If both R and DESeq are successfully installed, an updated file test_deseq.txt will be created in the same directory.
The Output File of MRFSeq
MRFSeq will create a predict.out file to summarize the prediction of DE states. The predict.out file is formed by an n by 3 matrix. The first column shows the Entrez Gene ID of the genes as in the input read counts file. The second column is the predictions of the DE states by using the maximum-flow algorithm, 1 for DE and 0 for EE. The marginal probabilities for the negative prediction (EE) are given in the third column. During the execution, MRFSeq will print the size of the coexpression network.
For the MAQC dataset,
. . .
create the coexpression network of genes ...
the network size (nodes, edges) : 836,2426
. . .
The running time on the MAQC dataset is about 20 minutes on your machines.