CS 234, Fall 2003: Computational Methods for the Analysis of Biomolecular Data

A staggering wealth of data has being generated by genome sequencing projects and other efforts to determine the structures and functions of biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze, cluster and classify biomolecular data.

Class Meeting

TR, 9:40am-11am, OLMH 1132

Office hours

TF, 11:10am-12noon, SURGE 320

Preliminary list of topics

  • overview on probability and statistics
  • intro to molecular and computational biology
  • analysis of 1D sequence data (DNA, RNA, proteins)
  • combinatorial algorithms and statistical methods for pattern discovery and sequence alignment
  • sequence alignment and hidden Markov models (HMM)
  • analysis of 2D time series data (gene expression data)
  • clustering algorithms
  • classification algorithms
  • subspace clustering/bi-clustering
  • analysis of 3D structures (protein structure) [time permitting]
  • protein structure
  • the H-P model
  • ad initio and comparative modeling
  • Prerequisites

    CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

    Course Format

    The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.

    Relation to Other Courses

    This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".


  • Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • Pierre Baldi, Soren Brunak, Bioinformatics: the machine learning approach, MIT press, 1998.
  • Joćo Setubal and Joćo Carlos Meidanis Introduction to Computational Molecular Biology, PWS Publishing Co., 1997.
  • Jason Wang, Bruce A. Shapiro, and Dennis Shasha, Pattern Discovery in Biomolecular Data Tools, Techniques, and Applications, Oxford University Press, 1999.
  • David Mount, Bioinformatics: Sequence and Genome Analysis Cold Spring Harbor Laboratory Press, 2002
  • Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002
  • Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001
  • Papers

  • Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]
  • Brona Brejova, Chrysanne DiMarco, Tomas Vinar, Sandra Romero Hidalgo, Gina Holguin, Cheryl Patten. "Finding Patterns in Biological Sequences". Unpublished TR. University of Waterloo, 2000 [PDF format]
  • Alberto Apostolico, Mary Ellen Bock, Stefano Lonardi, Xuyan Xu, "Efficient Detection of Unusual Words", Journal of Computational Biology, vol.7, no.1/2, pp.71-94, 2000 [PDF format]
  • Gesine Reinert, Sophie Schbath, Michael S. Waterman, "Probabilistic and Statistical Properties of Words: An Overview", Journal of Computational Biology, vol.7, no.1/2, 2000 [PDF format]
  • Todd Mood, "The Expectation-Maximization Algorithm", IEEE Signal Processing Magazine, Nov 1996 [PDF Format]
  • Jeff A. Bilmes, "A Gentle Tutorial of the EM Algorithm and its Applications to Parameter Estimation for Gaussian Mixture and HMM", UC Berkley, TR-97-021 [PDF Format]
  • C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, J. C. Wootton, "Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment", Science 262, 1993 [PDF Format]
  • Jun S. Liu, Andrew F. Neuwald, Charles E. Lawrence, "Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies", Journal of the American Statistical Association, 90(432), 1995 [PDF Format]
  • Analysis of microarray gene expression data. W. Huber, A. v.Heydebreck, M. Vingron. In Martin Bishop et al.(editors), Handbook of Statistical Genetics, 2nd Edition. John Wiley & Sons, Ltd., Chichester, UK, 2003.[PDF format]
  • Slides (not available anymore)

  • Slides [PDF Format 2slides/page] (Course Overview)
  • Slides [PDF Format 2slides/page] (Intro to Mol Biology)
  • Slides [PDF Format 2slides/page] (Intro to Prob/Statistics)
  • Slides [PDF Format 2slides/page] (Patterns)
  • Slides [PDF Format 2slides/page] (Deterministic patterns)
  • Slides [PDF Format 2slides/page] (Rigid patterns)
  • Slides [PDF Format 2slides/page] (HMMs)
  • Slides [PDF Format 2slides/page] (EM)
  • Slides [PDF Format 2slides/page] (Microarrays)
  • Resources

  • DNA Molecular animation
  • A bioinformatics glossary
  • What's a Genome (on-line book)
  • DNA interactive
  • Primer on Molecular Genetics
  • Daily news about bioinformatics
  • PMP Resources
  • Projects (not available anymore)

    Projects are individual. If the project involves programming, the student can choose any language he/she wishes provided that the language does not require any commercial compiler/interpreter. Perl, Python, Java, ANSI C/C++, Octave, etc are ok. Visual C, Visual Basic, Matlab, are not. For these projects it is strictly forbidden to obtain any source code (or portion thereof) from websites/books/friends/etc. The project will require to ...
  • Set up a publicly-accessible web page (deadline Oct 14th, see homework 1)
  • Choose one of the project from the list of project ideas or propose an alternative idea (need to be discussed and approved by the instructor) (deadline Oct 14th)
  • Post biweekly progress reports on the web page (deadlines Oct 28th, Nov 11th, Nov 25th)
  • Complete the project and submit a report by December 2nd, 2003 (11:59pm, Pacific Time). Send the report by email as a PDF attachment along with the source code (if any).
  • Homework (not available anymore)

  • Homework 1 (posted Oct 2, due Oct 14)
  • Homework 2 (posted Oct 14, due Oct 23)
  • Homework 3 (posted Oct 23, due Nov 4)
  • Homework 4 (posted Nov 6, due Nov 18)
  • Calendar

  • Sep 25: Course overview. Intro to molecular biology (proteins, DNA) [slides 1-29]
  • Sep 30: Intro to molecular biology (RNA, transcription, translation) [slides 30-61]
  • Oct   2: Intro to molecular biology (Genome, Molecular biology tools) [slides 62-87]
  • Oct   7: Intro to molecular biology (Molecular biology tools) [slides 87-end]
  • Oct   9: Intro to Statistics/Prob [slides 1-30]
  • Oct 14: Intro to Statistics/Prob [slides 31-end], Patterns [slides 1-10]
  • Oct 16: Patterns [10-50]
  • Oct 21: Patterns [51-end], Deterministic patterns [1-14]
  • Oct 23: Deterministic patterns [15-32]
  • Oct 28: Deterministic patterns [33-end]
  • Oct 30: Rigid patterns [1-24]
  • Nov   4: Midterm (closed book, closed notes, in class)
  • Nov   6: Rigid patterns [25-68]
  • Nov 11: Veterans' Day
  • Nov 13: HMMs [1-36]
  • Nov 18: guest lecture on "genome rearrangement problems" (J. Zheng)
  • Nov 20: HMMs [37-end]
  • Nov 25: Finding Profiles [1-15], Microarrays [1-23]
  • Nov 27: Thanksgiving
  • Dec   2: Microarrays [24-74]
  • Dec   4: Microarrays [75-end]