CS 234

Hello, I'm Chih-Ming Yen.
I will do "Benchmarking motif discovery softwares." as my project.

references: "Assessing computationaltools for the discovery of transcription factor binding sites" , "Improved benchmarks for computational motif discovery",
website:University of Washington, Computer Science & Engineering

<3/1>
Decide the project "Benchmarking motif discovery softwares."
<3/2>
I searched and tried to understand the softwares.
<3/7>
I try to install MEME and Gibbs motif sampler.
MEME have to run under Linux platform, so I install VMware to use Linux under Windows.
But when I install the MEME, some hardware problems happened. So I turned to install motif sampler first.
When I install motif sampler, some hardware problems happeded again, so I try to use another way to install motif sampler under windows platform.
I install cygwin to use motif sampler, but the total cygwin cost at least 4GB. I don't have enough space, then I found the on_line software for both of them is good enough. But the motif sampler reply the answer by email, sometimes it cost a lot of time or sometimes no reply.
<3/8>
I found other softwares such as Consensus, Improbizer, SeSiMCMC, Weeder. And I tried all of them to decide which software to use. Every time I run Consensus, it caust my computer to crash. And the SeSiMCMC has result notice pages, but I could not access the result pages eveytime. I cannot get into the weeder's webpage.
In the end, I decide to use MEME, Gibbs motif sampler ,and Improbizer.
<3/9>
I found the paper "Assessing computationaltools for the discovery of transcription factor binding sites" and "Improved benchmarks for computational motif discovery". Read them and tried to understood them.
<3/14>
Tried to understand the paper and think about how to split the gene sequences into subsequences.
<3/15>
I collected 5 yeast genes from NCBI, and find the binding sites from TRANSFAC database.
<3/16>
I selected the binding sites which defined from the paper and planted the selected binding sites on another random selected yeast gene sequence in the same location.
Then I used the gene sequence run over MEME and Improbizer.
<3/17>
I found some datasets which have already been benchmarked, and I choose 15 of them(5 humans, 5 mouses , and 5 yeast) to do prediction on MEME, Improbizer , and Gibbs motif sampler.
<3/18>
I found that in the already benchmarked datasets, all the subsequences has the same length. Also the motif positions for every subsequence spread discretely. But for the subsequences I divided, the motif positions for every subsequences gather in the end of each subsequences.
I still can't figure out how to split the sequence into small subsequences.

My e-mail is cyen@cs.ucr.edu
If you have any questions just email me.
Want to know more about me?
Please link here. http://www.cs.ucr.edu/~cyen