CS 234, Fall 2005: Computational Methods for the Analysis of Biomolecular Data

A staggering wealth of data has being generated by genome sequencing projects and other efforts to determine the structures and functions of biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze, cluster and classify biomolecular data.

Class Meeting

TR, 9:40am-11am, Sproul 2356

Office hours

TF, 11:10am-12noon, Engineering II 317

Preliminary list of topics

  • overview on probability and statistics
  • intro to molecular and computational biology
  • analysis of 1D sequence data (DNA, RNA, proteins)
  • combinatorial algorithms and statistical methods for pattern discovery and sequence alignment
  • sequence alignment and hidden Markov models (HMM)
  • analysis of 2D time series data (gene expression data)
  • clustering algorithms
  • classification algorithms
  • subspace clustering/bi-clustering
  • analysis of 3D structures (protein structure) [time permitting]
  • protein-protein interaction graphs
  • protein structure
  • the H-P model
  • ad initio and comparative modeling
  • Prerequisites

    CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

    Course Format

    The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.

    Relation to Other Courses

    This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".


  • Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • Pierre Baldi, Soren Brunak, Bioinformatics: the machine learning approach, MIT press, 1998.
  • Joćo Setubal and Joćo Carlos Meidanis Introduction to Computational Molecular Biology, PWS Publishing Co., 1997.
  • Jason Wang, Bruce A. Shapiro, and Dennis Shasha, Pattern Discovery in Biomolecular Data Tools, Techniques, and Applications, Oxford University Press, 1999.
  • David Mount, Bioinformatics: Sequence and Genome Analysis Cold Spring Harbor Laboratory Press, 2002
  • Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002
  • Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001
  • An Introduction to Bioinformatics Algorithms, Neil C. Jones and Pavel Pevzner, the MIT Press, 2004.
  • Papers

  • Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]
  • Brona Brejova, Chrysanne DiMarco, Tomas Vinar, Sandra Romero Hidalgo, Gina Holguin, Cheryl Patten. "Finding Patterns in Biological Sequences". Unpublished TR. University of Waterloo, 2000 [PDF format]
  • Alberto Apostolico, Mary Ellen Bock, Stefano Lonardi, Xuyan Xu, "Efficient Detection of Unusual Words", Journal of Computational Biology, vol.7, no.1/2, pp.71-94, 2000 [PDF format]
  • Gesine Reinert, Sophie Schbath, Michael S. Waterman, "Probabilistic and Statistical Properties of Words: An Overview", Journal of Computational Biology, vol.7, no.1/2, 2000 [PDF format]
  • Todd Mood, "The Expectation-Maximization Algorithm", IEEE Signal Processing Magazine, Nov 1996 [PDF Format]
  • Jeff A. Bilmes, "A Gentle Tutorial of the EM Algorithm and its Applications to Parameter Estimation for Gaussian Mixture and HMM", UC Berkley, TR-97-021 [PDF Format]
  • C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, J. C. Wootton, "Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment", Science 262, 1993 [PDF Format]
  • Jun S. Liu, Andrew F. Neuwald, Charles E. Lawrence, "Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies", Journal of the American Statistical Association, 90(432), 1995 [PDF Format]
  • Analysis of microarray gene expression data. W. Huber, A. v.Heydebreck, M. Vingron. In Martin Bishop et al.(editors), Handbook of Statistical Genetics, 2nd Edition. John Wiley & Sons, Ltd., Chichester, UK, 2003.[PDF format]
  • Slides

  • Slides [PDF Format 2slides/page] (Course Overview)
  • Slides [PDF Format 2slides/page] (Intro to Mol Biology)
  • Slides [PDF Format 2slides/page] (Some basic probability)
  • Slides [PDF Format 2slides/page] (Intro to Pattern Discovery)
  • Slides [PDF Format 2slides/page] (Pattern Discovery)
  • Slides [PDF Format 2slides/page] (HMM)
  • Slides [PDF Format 2slides/page] (Microarrays)
  • Resources

  • DNA Molecular animation
  • A bioinformatics glossary
  • What's a Genome (on-line book)
  • DNA interactive
  • Primer on Molecular Genetics
  • Daily news about bioinformatics
  • PMP Resources
  • Projects

  • Ryan Rusich's CS234 webpage
  • Vladimir Vacic's CS234 webpage
  • Huseyin Hakkoymaz's CS234 webpage
  • San Nguyen's CS234 webpage
  • Vi Pham's CS234 webpage
  • Antony Ming Lam's CS234 webpage
  • Yao Ma's CS234 webpage
  • Yonghui Wu's CS234 webpage
  • Mikiko Matsunaga's CS234 webpage
  • Van Le-Pham's CS234 webpage
  • Chun-Chih Wu's CS234 webpage
  • Kan Liu's CS234 webpage
  • Wei Yu's CS234 webpage
  • Jianfeng (Jeff) Yang's CS234 webpage
  • Sam Meshkin's CS234 webpage
  • Cao Yiqun's CS234 webpage
  • Homework

  • Homework 1 (posted Oct 4, due Oct 18)
  • Homework 2 (posted Oct 18, due Nov 1st)
  • Homework 2: Solution by YongHui Wu (posted Nov 3rd)
  • Homework 3 (posted Nov 3, due Nov 17)
  • Homework 4 (posted Nov 17, due Dec 1st)
  • Presentation

  • choose a slot 1-16 below and send me your preference
  • choose a paper among RECOMB 2005 or ISMB 2005 proceedings and send the title to me
  • send the Powerpoint file to me the day before the presentation (before 5pm)
  • give the 15 minutes presentation (make sure you time it correctly, I will stop you after 15mins)
  • Calendar of Lectures

  • Sep 29: Intro, Molecular Biology [slides 1-26]
  • Oct   4: Molecular Biology [slides 27-52]
  • Oct   6: Molecular Biology [slides 53-80]
  • Oct 11: Molecular Biology [slides 80-end]
  • Oct 13: Intro to Probability [slides 1-22]
  • Oct 18: Intro to Probability [slides 22-end], Intro to Pattern Discovery [1-17] (HW1 due)
  • Oct 20: Intro to Pattern Discovery [18-52]
  • Oct 25: Intro to Pattern Discovery [53-end]
  • Oct 27: Discovery of Rigid Patterns [1-43]
  • Nov   1: Discovery of Rigid Patterns [44-75] (HW2 due)
  • Nov   3: Discovery of Rigid Patterns [75-end], HMM [1-9]
  • Nov   8: MIDTERM (closed books, closed notes)
  • Nov 10: HMM [10-43]
  • Nov 15: HMM [44-end], Microarrays [1-10]
  • Nov 17: Microarrays [10-52] (HW3 due)
  • Nov 22: Microarrays [52-end], BioNetworks [1-end]
  • Nov 24: Thanksgiving

  • Nov 29: Presentations. (deadline for the PPT file is Nov 28th, 5PM)
    1: Sam (A Linear-Time Algorithm for the Perfect ..., RECOMB05)
    2: Ryan (Predicting protein-protein interaction ..., ISMB05)
    3: Kan (Automatic detection of subsystem/pathway ..., ISMB05)
    4: Yao (PILER: identification and classification of ..., ISMB05)

  • Dec   1: Presentations. (deadline for the PPT file is Nov 30th, 5PM) (HW4 due)
    5: Mikiko (Improving protein structure prediction ..., ISMB05)
    6: Yonghui (Pairwise Local Alignment of Protein ..., RECOMB05)
    7: Vi (A Polynomial Time Solvable Formulation ..., RECOMB05)
    8: Van (Clustering Short Time Series Gene ..., ISMB05)

  • Dec   6: Presentations. (deadline for the PPT file is Dec 5th, 5PM)
    9: Huseyin (High-recall protein entity recognition ..., ISMB05)
    10: Wei (Reversal Distance for Partially ..., ISMB05)
    11: Antony (Three-Stage Prediction of Protein Beta-Sheets ..., ISMB05)
    12: Chun-Chih (De novo identification of repeat ..., ISMB05)

  • Dec   8: Presentations. (deadline for the PPT file is Dec 7th, 5PM)
    13: San (GenRate - A Generative Model that Finds ..., RECOMB05)
    14: Vlada (Motif-based protein ranking ..., ISMB05)
    15: Yiqun (ExonHunter: a comprehensive approach ..., ISMB05)
    16: Jeff (Protein function prediction via graph kernels ..., ISMB05)
  • Project Demo (in my office, please bring your laptop)

  • Dec 7: 9:30 Vi
              10:00 Kan
              10:30 San
              11:00 Huseyin
              11:30 Van
  • Dec 8: 2:00 Anthony
              2:30 Yao
              3:00 Sam
              3:30 Yonghui
              4:00 Yiqun
              4:30 Chun-Chih
  • Dec 9: 9:30 Ryan
              10:00 Jeff
              10:30 Wei
              11:00 Vlada
              11:30 Mikiko