CS 260: Pattern Discovery in Biosequences

A staggering wealth of data has being generated by genome sequencing projects and other efforts to determine the structures and functions of biological systems. This seminar course will focus on a selection of computational problems aimed at automatically analyze, cluster and classify biomolecular data.

Class Meeting

TR 3:40pm-5pm, PRCE 3374

Office hours

TR 5:10pm-6:10pm, SURGE 320

Preliminary list of topics

Overview on probability and statistics (2 lectures), introduction to molecular and computational biology (3 lectures), pattern discovery and machine learning (2 lectures), enumerative algorithms for pattern discovery (4 lectures), hidden Markov models and other statistical methods (e.g, Gibbs sampler, EM) for pattern discovery (3 lectures). There may be guest lectures and discussion sessions on special problems. The actual selection of topics may be guided by the interests of the participants.


CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

Course Format

The course will include lectures by the instructor, class discussions, and presentations by the students. The actual format will depend on the class size and the background of the students enrolled.
Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be two to three assignments, mostly of theoretical nature -- although some may require programming. At the end of the course, students are required to give a presentation on a research topic selected from a list provided by the instructor. Original projects or proposals will be taken into consideration.

Relation to Other Courses

This seminar course is intended to complement the seminar course on "Algorithms in Computational Molecular Biology" previously taught by Prof.T.Jiang, and "CS235:Data Mining Concepts" usually thought by Prof.D.Gunopulos.


  • Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • Pavel A. Pevzner, Computational Molecular Biology: An Algorithmic Approach, MIT Press, 2000.
  • Joao Setubal and Joao Carlos Meidanis Introduction to Computational Molecular Biology, PWS Publishing Co., 1997.
  • Jason Wang, Bruce A. Shapiro, and Dennis Shasha, Pattern Discovery in Biomolecular Data Tools, Techniques, and Applications, Oxford University Press, 1999.
  • Pierre Baldi, Soren Brunak, Bioinformatics: the machine learning approach, MIT press, 1998.
  • Download

  • List of topics for presentation [Postscript Format]
  • "Primer on Molecular Genetics" [Link]
  • Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]
  • Brona Brejova, Chrysanne DiMarco, Tomas Vinar, Sandra Romero Hidalgo, Gina Holguin, Cheryl Patten. "Finding Patterns in Biological Sequences". Unpublished TR. University of Waterloo, 2000 [PDF format]
  • Alberto Apostolico, Mary Ellen Bock, Stefano Lonardi, Xuyan Xu, "Efficient Detection of Unusual Words", Journal of Computational Biology, vol.7, no.1/2, pp.71-94, 2000 [PDF format]
  • Gesine Reinert, Sophie Schbath, Michael S. Waterman, "Probabilistic and Statistical Properties of Words: An Overview", Journal of Computational Biology, vol.7, no.1/2, 2000 [PDF format]
  • Todd Mood, "The Expectation-Maximization Algorithm", IEEE Signal Processing Magazine, Nov 1996 [PDF Format]
  • Jeff A. Bilmes, "A Gentle Tutorial of the EM Algorithm and its Applications to Parameter Estimattion for Gaussian Mixture and HMM", UC Berkley, TR-97-021 [PDF Format]
  • C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, J. C. Wootton, "Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment", Science 262, 1993 [PDF Format]
  • Jun S. Liu, Andrew F. Neuwald, Charles E. Lawrence, "Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies", Journal of the American Statistical Association, 90(432), 1995 [PDF Format]
  • Slides (not available fow download)

  • Slides [PDF Format] (Course Overview)
  • Slides [PDF Format] (Intro to Molecular Biology)
  • Slides [PDF Format] (Some Probability and Statistics)
  • Slides [PDF Format] (Intro to Machine Learning and Pattern Discovery)
  • Slides [PDF Format] (Discovering Deterministic Patterns)
  • Slides [PDF Format] (Discovering Rigid Patterns)
  • Slides [PDF Format] (Statistical Approaches)
  • Slides [PDF Format] (Discovering Profiles)
  • Resources

  • PMP Resources
  • Homeworks (not available for download)

  • Homework 1, due Oct 30th [Postscript Format] [PDF Format]
  • Homework 2, due Dec 6th [Postscript Format] [PDF Format]
  • Topics

  • Oct 2: Course overview,
    Intro to Molecular Biology (1/3)
  • Oct 4: Intro to Molecular Biology (2/3)
  • Oct 9: Intro to Molecular Biology (3/3),
    Probability and Statistics (1/2)
  • Oct 11: Probability and Statistics (2/2)
  • Oct 16: Intro to Pattern Discovery and Machine Learning (1/2)
  • Oct 18: Intro to Pattern Discovery and Machine Learning (2/2),
    Discovering Deterministic Patterns (1/2)
  • Oct 23: Discovering Deterministic Patterns (2/2)
  • Oct 25: Discovering Rigid Patterns (1/2)
  • Oct 30: Discovering Rigid Patterns (2/2)
  • Nov 1: Statistical Methods (1/2)
  • Nov 6: Statistical Methods (2/2)
  • Nov 8: Discovering Profiles (1/2)
  • Nov 13: Carlotta Domeniconi ("A Classification Approach for Prediction of Target Events in Temporal Sequences")
  • Nov 15: Discovering Profiles (2/2)
  • Nov 20: Li Jia ("Why is AtMYB gene family subjected to the positive selection?")
  • Nov 22: No Class (Thanksgiving)
  • Nov 27: Yan, Kun (promoter prediction)
    Jones, John (projection algorithm)
  • Nov 29: Liu, Zheng (DNA segmentation)
    Zheng, Jie (data compression)
  • Dec 4: Luo, Yu (gene expression analysis)
    Xu, Ying (data compression)
  • Dec 6: Dreier, Derek (splicing sites prediction)
    Lonardi, Stefano (data compression)