CS 234, Winter 2016: Project ideas

Below are some ideas for the CS 234 projects. Most of them require that you get first familiar with the biological problem (i.e., reading papers/books) that they are addressing.
Rules:
1. Projects are individual (unless you get permission from me to work on a team on a bigger project).
2. If you would like to propose a different project idea that may be relevant to the course feel free to do so, however, original ideas must be discussed and approved by the instructor.
3. You can use any language you would deem appropriate for the project (e.g., C++, Java, Python, Perl, Ruby, MatLab) and any library that can be helpful (provided that does not solve directly the problem you are addressing). Projects that require strict time-efficiency should use compiled languages (C/C++).
4. You will have to post progress report (bi-weekly) on your webpage.
5. During finals' week, I will ask each of you to give me a 1/2 hour demo of your project in my office.
6. You cannot re-use the same project you did for any other classes at UCR.
7. The difficulty of the project will be taken into consideration in the final grade.

1. Shredding and mapping short reads. Design and implement two simple tools: the first one takes a chromosome as input, and generated n simulated reads of length l, possibly introducing artificial sequencing errors at 1% rate; the second tool takes in input the chromosome and the reads, and tries to map them. The reads that maps to one location of the genome ('unique') will be mapped, whereas reads that map to more than one location ('ambigouos') will be discarded. Your implementation must be time- and space-efficient. Collect data on the number of unique reads as a function of the value of l and the chosen genome. Collect data on time and memory used for mapping the reads.

2. Metagenomics binning. Collect a set of m bacterial genomes from GenBank, then generate n simulated reads of length l, possibly introducing artificial sequencing errors at 1% rate; The mix of all the reads together is the input to your tool along with the parameter m. The objective of this project is to implement a method to separate the reads into m bin (corresponding to the original genomes) as accurately as possible: the simplest method is to use the distribution of k-mers in each read, tipically 4-mers. Represent the count of occurrences of each of the 64 possible 4-mers in the read as a 64-dimentional vector, then use a clustering algorithm on these vectors to decide where to assign the reads (e.g., k-means wher k=m).

3. Motif discovery. Implement the "random projection" algorithm for motif finding described in Tompa and Buhler, Finding Motifs Using Random Projections, Proc. RECOMB, 67-74, 2001. Run the program and collect experimental data. How would you improve its performance?

4. Suffix trees. Design a "generalized suffix tree" C++ class. Design an algorithm/program to find the maximal unique matches between two long strings (ideally chromosomes). You implementation must be space-efficient. Collect data on time and space used for different input sizes. See Gusfield's book for a description of the generalized suffix tree.

5. Suffix Arrays. Design a "suffix array" C++ class. Design an algorithm/program to find the maximal unique matches between two long strings (ideally chromosomes). You implementation must be space-efficient. Collect data on time and space used for different input sizes.

6. Splicing site recognition. Carry out the project described in the following page (courtesy of Forbes Burkowski, University of Waterloo)

7. Gene expression analysis. Carry out the project described in the following page (courtesy of Yizong Cheng, University of Cincinnati). Test other measures of similarity between time series (instead of the correlation coefficient) and compare the results. note: compared to the other ones, this is an "easy" project and unless you do something "creative" with the analysis, I will take into consideration its difficulty in the grade.

8. Biclustering gene expression data. Implement the "random projection" algorithm for biclustering gene expression data described in the following paper. Run the program and collect experimental data. How would you improve its performance?

9. Motif discovery. Implement the "gibb sampling" algorithm for motif finding described in Lawrence et al, "Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment, Science, 262, 208-214, 1993 (also described in the slides). Run the program and collect experimental data. How would you improve its performance?

10. Benchmarking motif discovery softwares. Compare to one another the most common motif discovery algorithms (MEME, consensus, motif sampler, pratt, weeder, projection, etc.). Design a benchmark data set and collect experimental results. Which is the best and in which circumstances?

11. Benchmarking gene recognition softwares. Compare to one another the most common programs (GRAIL, GENSCAN, GLIMMER, VEIL, EVIGAN, etc.) for gene prediction (a list of tools can be found at here). Design a benchmark data set and collect experimental results. Which is the best and in which circumstances?

13. Protein-Protein interaction graph analysis. Write a program to read the edge list for fly and yeast protein-protein interaction graphs. Then run some comparative analysis on the two graphs. Compute network diameter, clustering coefficient, degree centrality, betweenness centrality, etc. See this wikipedia entry for more information on distances in graph theory. Can you derive any conclusion from the comparative analysis of these two graphs?

14. PDB viewer. Design a program that parses a PDB file (from the PDB database), and draws a simple picture of the protein using small disks to represent atoms and joining consecutive atoms by lines. Color atoms contained in alpha-helices red, and atoms contained in beta-sheet blue, an other black. Run the viewer on 2BOP, 1RNB, 1CD8 and other two proteins that you may find interesting (courtesy of Daniel Huson).

15. Baum-Welch. Implement the Baum-Welch algorithm for HMM described in class. Test the training of the HMM on a biological relevant dataset (e.g. TF binding sites), and test in on a chrosomome. Collect experimental time measurements.