CS 234, Winter 2009: Computational Methods for the Analysis of Biomolecular Data

A staggering wealth of data has being generated by genome sequencing projects and other efforts to determine the structures and functions of biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze, cluster and classify biomolecular data.

Class Meeting

12:40 p.m. - 02:00 p.m. Engineering II room 141

Office hours

By appointment, please email me

Preliminary list of topics

  • overview on probability and statistics
  • intro to molecular and computational biology
  • analysis of 1D sequence data (DNA, RNA, proteins)
  • combinatorial algorithms and statistical methods for pattern discovery and sequence alignment
  • sequence alignment and hidden Markov models (HMM)
  • analysis of 2D data (gene expression data and graphs)
  • clustering algorithms
  • classification algorithms
  • subspace clustering/bi-clustering
  • genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs
  • Prerequisites

    CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

    Course Format

    The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.

    Relation to Other Courses

    This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".

    References (books)

  • Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • Pierre Baldi, Soren Brunak, Bioinformatics: the machine learning approach, MIT press, 1998.
  • Joao Setubal and Joao Carlos Meidanis Introduction to Computational Molecular Biology, PWS Publishing Co., 1997.
  • Jason Wang, Bruce A. Shapiro, and Dennis Shasha, Pattern Discovery in Biomolecular Data Tools, Techniques, and Applications, Oxford University Press, 1999.
  • David Mount, Bioinformatics: Sequence and Genome Analysis Cold Spring Harbor Laboratory Press, 2002
  • Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002
  • Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001
  • An Introduction to Bioinformatics Algorithms, Neil C. Jones and Pavel Pevzner, the MIT Press, 2004.
  • References (papers)

  • Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]
  • Brona Brejova, Chrysanne DiMarco, Tomas Vinar, Sandra Romero Hidalgo, Gina Holguin, Cheryl Patten. "Finding Patterns in Biological Sequences". Unpublished TR. University of Waterloo, 2000 [PDF format]
  • Alberto Apostolico, Mary Ellen Bock, Stefano Lonardi, Xuyan Xu, "Efficient Detection of Unusual Words", Journal of Computational Biology, vol.7, no.1/2, pp.71-94, 2000 [PDF format]
  • Gesine Reinert, Sophie Schbath, Michael S. Waterman, "Probabilistic and Statistical Properties of Words: An Overview", Journal of Computational Biology, vol.7, no.1/2, 2000 [PDF format]
  • Todd Mood, "The Expectation-Maximization Algorithm", IEEE Signal Processing Magazine, Nov 1996 [PDF Format]
  • Jeff A. Bilmes, "A Gentle Tutorial of the EM Algorithm and its Applications to Parameter Estimation for Gaussian Mixture and HMM", UC Berkley, TR-97-021 [PDF Format]
  • C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, J. C. Wootton, "Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment", Science 262, 1993 [PDF Format]
  • Jun S. Liu, Andrew F. Neuwald, Charles E. Lawrence, "Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies", Journal of the American Statistical Association, 90(432), 1995 [PDF Format]
  • Slides

  • Slides [PDF Format 2slides/page] (Course Overview)
  • Slides [PDF Format 2slides/page] (Intro to Mol Biology)
  • Slides [PDF Format 2slides/page] (Some basic probability)
  • Slides [PDF Format 2slides/page] (Intro to Pattern Discovery)
  • Slides [PDF Format 2slides/page] (Discovery of Rigid Patterns)
  • Slides [PDF Format 2slides/page] (HMM)
  • Slides [PDF Format 2slides/page] (Microarrays)
  • Slides [PDF Format 2slides/page] (Biological networks)
  • Resources

  • The inner life of a Cell
  • DNA Molecular animation
  • A bioinformatics glossary
  • What's a Genome (on-line book)
  • DNA interactive
  • Primer on Molecular Genetics
  • PMP Resources
  • Projects

  • Project ideas and rules
  • David Cohen's CS234 webpage
  • Doxa Chatzopoulou CS234 webpage
  • Kamal Adusumilli CS234 webpage
  • Astro David Ashtaralnakhai CS234 webpage
  • Wei Li CS234 webpage
  • Bao Ngo CS234 webpage
  • Steve Cole CS234 webpage
  • Md Reaz Uddin CS234 webpage
  • Ben Smith CS234 webpage
  • Sima Lofti CS234 webpage
  • Homework

  • Homework 1 (posted Jan 13, due Jan 27)
  • Homework 2 (posted Jan 29, due Feb 12)
  • Homework 3 (posted Feb 12, due Feb 26)
  • Presentation

  • choose a slot 1-10 below and send me your choice
  • choose a paper among RECOMB 2008 or ISMB 2008 proceedings and send the title to me
  • send the Powerpoint file to me the day before the presentation (before 5pm)
  • give the 15 minutes presentation (make sure you time it correctly, I will stop you after 15mins)
  • Calendar of Lectures

  • Jan   6: Intro, Molecular Biology
  • Jan   8: Molecular Biology
  • Jan 13: Molecular Biology [hw1 posted]
  • Jan 15: Molecular Biology
  • Jan 20: Statistics and Probability
  • Jan 22: Pattern Discovery
  • Jan 27: Pattern Discovery [hw1 due]
  • Jan 29: Pattern Discovery [hw2 posted]
  • Feb   3: Pattern Discovery
  • Feb   5: Pattern Discovery
  • Feb 10: HMM
  • Feb 12: HMM [hw2 due, hw3 posted]
  • Feb 17: MIDTERM (in class, closed books, closed notes)
  • Feb 19: Microarrays
  • Feb 24: Microarrays
  • Feb 26: Networks [hw3 due]
  • Mar 3: Networks
  • Mar 5: Presentations. (deadline for the PPT file is Mar 4, 5PM)
    1: Sima Lotfi (Estimating true evolutionary distances under the DCJ model, ISMB 2008)
    2: Doxa Chatzopoulou (The EXACT description of biomedical protocols, ISMB 2008)
  • Mar 10: Presentations. (deadline for the PPT file is Mar 9, 5PM)
    3: David Cohen (More Efficient Algorithms for Closest String and Substring Problems, RECOMB 2008)
    4: Keerthi Kamal Adusumilli (Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space, ISMB 2008)
    5: Steve Cole (MicroRNA prediction with a novel ranking algorithm based on random walks, ISMB 2008)
    6: Md Reaz Uddin (Protein complex identification by supervised graph local clustering, ISMB 2008)
  • Mar 12: Presentations. (deadline for the PPT file is Mar 11, 5PM)
    7: Wei Li (Ab Initio Whole Genome Shotgun Assembly with Mated Short Reads, RECOMB 2008)
    8: Bao Ngo (A distance metric for a class of tree-sibling phylogenetic networks, )
    9: Ben Smith (CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads, RECOMB 2008)
    10: Astro David Ashtaralnakhai (Selecting anti-HIV therapies based on a variety of genomic and clinical factors, ISMB 2008)
  • Project Demo (in my office, please bring your laptop)

  • Wed, Mar 18
              11:00 Sima
              11:30 Wei Li
              1:30 David
              2:00 Kamal
  • Thu, Mar 19
              11:00 Md Reaz Uddin
              11:30 Gloria
              1:30 Ben
              2:00 Astro