CS 234, Winter 2010: Computational Methods for the Analysis of Biomolecular Data

A staggering wealth of data has being ammassed by genome sequencing projects and other efforts to determine the structures and functions of biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze, cluster and classify biomolecular data.

Class Meeting

12:40 p.m. - 02:00 p.m. Engineering II room 139

Office hours

By appointment, please email me

Preliminary list of topics

  • overview on probability and statistics
  • intro to molecular and computational biology
  • analysis of 1D sequence data (DNA, RNA, proteins)
  • combinatorial algorithms and statistical methods for pattern discovery and sequence alignment
  • sequence alignment and hidden Markov models (HMM)
  • analysis of 2D data (gene expression data and graphs)
  • clustering algorithms
  • classification algorithms
  • subspace clustering/bi-clustering
  • genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs
  • Prerequisites

    CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

    Course Format

    The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.

    Relation to Other Courses

    This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".

    References (books)

  • Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • Pierre Baldi, Soren Brunak, Bioinformatics: the machine learning approach, MIT press, 1998.
  • Joao Setubal and Joao Carlos Meidanis Introduction to Computational Molecular Biology, PWS Publishing Co., 1997.
  • Jason Wang, Bruce A. Shapiro, and Dennis Shasha, Pattern Discovery in Biomolecular Data Tools, Techniques, and Applications, Oxford University Press, 1999.
  • David Mount, Bioinformatics: Sequence and Genome Analysis Cold Spring Harbor Laboratory Press, 2002
  • Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002
  • Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001
  • An Introduction to Bioinformatics Algorithms, Neil C. Jones and Pavel Pevzner, the MIT Press, 2004.
  • References (papers)

  • Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]
  • Brona Brejova, Chrysanne DiMarco, Tomas Vinar, Sandra Romero Hidalgo, Gina Holguin, Cheryl Patten. "Finding Patterns in Biological Sequences". Unpublished TR. University of Waterloo, 2000 [PDF format]
  • Alberto Apostolico, Mary Ellen Bock, Stefano Lonardi, Xuyan Xu, "Efficient Detection of Unusual Words", Journal of Computational Biology, vol.7, no.1/2, pp.71-94, 2000 [PDF format]
  • Gesine Reinert, Sophie Schbath, Michael S. Waterman, "Probabilistic and Statistical Properties of Words: An Overview", Journal of Computational Biology, vol.7, no.1/2, 2000 [PDF format]
  • Todd Mood, "The Expectation-Maximization Algorithm", IEEE Signal Processing Magazine, Nov 1996 [PDF Format]
  • Jeff A. Bilmes, "A Gentle Tutorial of the EM Algorithm and its Applications to Parameter Estimation for Gaussian Mixture and HMM", UC Berkley, TR-97-021 [PDF Format]
  • C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, J. C. Wootton, "Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment", Science 262, 1993 [PDF Format]
  • Jun S. Liu, Andrew F. Neuwald, Charles E. Lawrence, "Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies", Journal of the American Statistical Association, 90(432), 1995 [PDF Format]
  • Slides

  • Slides [PDF Format 2slides/page] (Course Overview)
  • Slides [PDF Format 2slides/page] (Intro to Mol Biology) [updated Jan 26]
  • Slides [PDF Format 2slides/page] (Sequence Analysis)
  • Slides [PDF Format 2slides/page] (Indexing and Searching)
  • Slides [PDF Format 2slides/page] (Bio Networks)
  • Slides [PDF Format 2slides/page] (Microarrays)
  • Resources

  • The inner life of a Cell
  • DNA Molecular animation
  • A bioinformatics glossary
  • What's a Genome (on-line book)
  • DNA interactive
  • Primer on Molecular Genetics
  • PMP Resources
  • Projects

  • Project ideas and rules
  • Doruk Sart's CS234 webpage
  • Nkenge S. Wheatland's CS234 webpage
  • Alex Edgcomb's CS234 webpage
  • Jianbo Chen's CS234 webpage
  • Busra Celikkaya's CS234 webpage
  • Curtis Yu's CS234 webpage
  • Dan Grissom's CS234 webpage
  • Mohammad Farhan Habib's CS234 webpage
  • Jesin Zakaria's CS234 webpage
  • Sanjay Kulhari's CS234 webpage
  • Md. Mahbub Hasan's CS234 webpage
  • Ali Mirzabeigi's CS234 webpage
  • Zhixing Jin's CS234 webpage
  • Jarod Wen's CS234 webpage
  • David Keith's CS234 webpage
  • Xin Liu's CS234 webpage
  • Thanawin (Art) Rakthanmanon's CS234 webpage
  • Hanlung Lin's CS234 webpage
  • Burair Al-Saihati's CS234 webpage
  • Anton Polishko's CS234 webpage
  • Denise Duma's CS234 webpage
  • Aleksandr Levchuk's CS234 webpage
  • Mehran Kafai's CS234 webpage
  • Olga Tanaseichuk's CS234 webpage
  • Katia Mkrtchyan's CS234 webpage
  • Amirali's CS234 webpage
  • Xiaoqing Jin's CS234 webpage
  • Claire Huang's CS234 webpage
  • Homework

  • Homework 1 (posted Jan 12, due Jan 26)
  • Homework 2 (posted Jan 28, due Feb 11)
  • Homework 2 solution (by Y. Wu)
  • Homework 3 (posted Feb 16, due Mar 2)
  • Midterm

  • Midterm 2009 (posted Feb 16)
  • Final report

  • Final report ideas and rules
  • Calendar of Lectures

  • Jan   5: Intro, Molecular Biology
  • Jan   7: Molecular Biology
  • Jan 12: Molecular Biology [hw1 posted]
  • Jan 15: Molecular Biology
  • Jan 19: Guest Lecture (Elena Harris)
  • Jan 21: Molecular Biology
  • Jan 26: Statistical Analysis of Sequences [hw1 due]
  • Jan 28: Statistical Analysis of Sequences [hw2 posted]
  • Feb  2: Statistical Analysis of Sequences
  • Feb  4: Statistical Analysis of Sequences
  • Feb 9: Statistical Analysis of Sequences
  • Feb 11: Indexing for sequences[hw2 due]
  • Feb 16: Indexing for sequences [hw3 posted]
  • Feb 18: Indexing for sequences
  • Feb 23: Indexing for sequences
  • Feb 25: MIDTERM (in class, closed books, closed notes)
  • Mar 2: Microarrays [hw3 due]
  • Mar 4: Microarrays
  • Mar 9: Networks
  • Mar 11: Networks
  • Project Demo (20 minutes, in my office, please bring your laptop)

  • Mon, Mar 15
              10:00 amirali
              10:20 Alex
              10:40 Han-lung Lin
              11:00 Jian Wen
              11:20 Busra Celikkaya
              11:40 Burair Al-Saihati
  • Tue, Mar 16
              10:00 Thanawin Rakthanmanon
              10:20 Yu-Ting Huang
              10:40 Ali Mirzabeigi
              11:00 Mehran Kafai
              11:20 Doruk Sart
              11:40 Dan Grissom
  • Wed, Mar 17
              10:00 Xin Liu
              10:20 Olga Tanaseichuk
              10:40 David Keith
              11:00 Curtis Yu
              11:20 Nkenge Wheatland
              11:40 Sanjay Kulhari
  • Thu, Mar 18
              10:00 Jesin Zakaria
              10:20 Hasan
              10:40 xiaoqing jin
              11:00 Katya
              11:20 Jianbo
              11:40 Mohammad Farhan Habib