CS 234, Winter 2011: Computational Methods for the Analysis of Biomolecular Data

A staggering wealth of data has being ammassed by genome sequencing projects and other efforts to determine the structures and functions of biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze, cluster and classify biomolecular data.

Class Meeting

12:40 p.m. - 02:00 p.m. Engineering II room 141

Office hours

By appointment, please email me

Preliminary list of topics

  • overview on probability and statistics
  • intro to molecular and computational biology
  • analysis of 1D sequence data (DNA, RNA, proteins)
  • Space-efficient data structures for sequences
  • Short read mapping
  • Sequence alignment and hidden Markov models (HMM)
  • analysis of 2D data (gene expression data and graphs)
  • clustering algorithms
  • classification algorithms
  • subspace clustering/bi-clustering
  • genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs
  • Prerequisites

    CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

    Course Format

    The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.

    Relation to Other Courses

    This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".

    References (books)

  • Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • Pierre Baldi, Soren Brunak, Bioinformatics: the machine learning approach, MIT press, 1998.
  • Joao Setubal and Joao Carlos Meidanis Introduction to Computational Molecular Biology, PWS Publishing Co., 1997.
  • Jason Wang, Bruce A. Shapiro, and Dennis Shasha, Pattern Discovery in Biomolecular Data Tools, Techniques, and Applications, Oxford University Press, 1999.
  • David Mount, Bioinformatics: Sequence and Genome Analysis Cold Spring Harbor Laboratory Press, 2002
  • Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002
  • Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001
  • An Introduction to Bioinformatics Algorithms, Neil C. Jones and Pavel Pevzner, the MIT Press, 2004.
  • References (papers)

  • Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]
  • Brona Brejova, Chrysanne DiMarco, Tomas Vinar, Sandra Romero Hidalgo, Gina Holguin, Cheryl Patten. "Finding Patterns in Biological Sequences". Unpublished TR. University of Waterloo, 2000 [PDF format]
  • Alberto Apostolico, Mary Ellen Bock, Stefano Lonardi, Xuyan Xu, "Efficient Detection of Unusual Words", Journal of Computational Biology, vol.7, no.1/2, pp.71-94, 2000 [PDF format]
  • Gesine Reinert, Sophie Schbath, Michael S. Waterman, "Probabilistic and Statistical Properties of Words: An Overview", Journal of Computational Biology, vol.7, no.1/2, 2000 [PDF format]
  • Todd Mood, "The Expectation-Maximization Algorithm", IEEE Signal Processing Magazine, Nov 1996 [PDF Format]
  • Jeff A. Bilmes, "A Gentle Tutorial of the EM Algorithm and its Applications to Parameter Estimation for Gaussian Mixture and HMM", UC Berkley, TR-97-021 [PDF Format]
  • C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, J. C. Wootton, "Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment", Science 262, 1993 [PDF Format]
  • Jun S. Liu, Andrew F. Neuwald, Charles E. Lawrence, "Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies", Journal of the American Statistical Association, 90(432), 1995 [PDF Format]
  • Slides

  • Slides [PDF Format 2slides/page] (Course Overview)
  • Slides [PDF Format 2slides/page] (Intro to Mol Biology) (updated Jan 24th)
  • Slides [PDF Format 2slides/page] (Sequence Analysis)
  • Slides [PDF Format 2slides/page] (Indexing and Searching)
  • Slides [PDF Format 2slides/page] (Bio Networks)
  • Resources

  • The inner life of a Cell
  • DNA Molecular animation
  • A bioinformatics glossary
  • What's a Genome (on-line book)
  • DNA interactive
  • Primer on Molecular Genetics
  • PMP Resources
  • Projects

  • Project ideas and rules
  • Ergude Bao's CS234 webpage
  • Shaofang Li's CS234 webpage
  • Jaehoon Kim's CS234 webpage
  • Mohammad Shokoohi Yekta's CS234 webpage
  • Elena Strzheletska's CS234 webpage
  • Mohammad Khorramzadeh's CS234 webpage
  • Seyed Mirebrahim's CS234 webpage
  • Mo Cao's CS234 webpage
  • Luan Nguyen's CS234 webpage
  • Homework

  • Homework 1 (posted Jan 11, due Jan 25)
  • Homework 2 (posted Jan 27, due Feb 10)
  • Homework 2 solution (by Y. Wu)
  • Homework 3 (posted Feb 15, due Mar 1)
  • Midterm

  • Midterm 2009 (posted Feb 10)
  • Presentation

  • choose a slot 1-10 below and send me your choice
  • choose a paper among the Proceedings of RECOMB 2010 or ISMB 2010 and send the title to me
  • send the Powerpoint file to me the day before the presentation (before 5pm)
  • give the 20 minutes presentation (make sure you time it correctly, I will stop you after 20 mins)
  • Calendar of Lectures

  • Jan   4: Intro, Molecular Biology
  • Jan   6: Molecular Biology
  • Jan 11: Molecular Biology [hw1 posted]
  • Jan 13: Molecular Biology
  • Jan 18: Molecular Biology
  • Jan 20: Molecular Biology
  • Jan 25: Statistical Analysis of Sequences [hw1 due]
  • Jan 27: Statistical Analysis of Sequences [hw2 posted]
  • Feb  1: Statistical Analysis of Sequences
  • Feb  3: Statistical Analysis of Sequences
  • Feb  8: Statistical Analysis of Sequences
  • Feb 10: Indexing for sequences [hw2 due]
  • Feb 15: CANCELLED [hw3 posted]
  • Feb 17: Indexing for sequences
  • Feb 22: Indexing for sequences
  • Feb 24: MIDTERM (in class, closed books, closed notes)
  • Mar  1: Biological Networks [hw3 due]
  • Mar  3: Presentations. (deadline for the PPT file is Mar 2, 5PM)
    1: Jaehoon (LCE: a link-based cluster ensemble method for improved gene)
    2: Luan (Discovering Regulatory Overlapping RNA Transcripts)
    3: Mohammad Sh (Hierarchical Generative Biclustering for MicroRNA Expression Analysis)
  • Mar  8: Presentations. (deadline for the PPT file is Mar 7, 5PM)
    4: Shaofang (Seed Design Framework for Mapping SOLiD Reads, RECOMB)
    5: Hamid (Predicting Nucleosome Positioning Using Multiple Evidence Tracks, RECOMB)
    6: Mohammad Kh (Leveraging Sequence Classification by Taxonomy-Based Multitask Learning)
  • Mar 10: Presentations. (deadline for the PPT file is Mar 9, 5PM)
    7: Bao (IDBA: A Practical Iterative de Bruijn Graph De Novo Assembler, RECOMB)
    8: Mo (Algorithms for detecting Significantly Mutated Pathways in Cancer)
    9: Elena (An Algorithmic Framework for Predicting Side-Effects of Drugs, RECOMB 2010)
  • Project Demo (20-25 minutes demo, 5-10 minutes questions, in my office, please bring your laptop)

  • Wed, Mar 16
              2:00 Mohammad Sh
              2:30 Jaehoon
              3:00 Elena
              3:30 Mo
  • Thu, Mar 17
              10:00 Shaofang
              10:30 Hamid
              11:00 Luan
              11:30 Bao
              1:30 Mohammad Kh