CS 234, Winter 2012: Computational Methods for the Analysis of Biomolecular Data

An impressive wealth of data has being ammassed by genome sequencing projects and other efforts to determine the structures and functions of biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze, cluster and classify biomolecular data.

Class Meeting

12:40 p.m. - 02:00 p.m. INTS 2134

Office hours

Open door policy or by appointment (email me)

Preliminary list of topics

  • overview on probability and statistics
  • intro to molecular and computational biology
  • analysis of 1D sequence data (DNA, RNA, proteins)
  • Space-efficient data structures for sequences
  • Short read mapping (suffix trees, suffix arrays, BWT)
  • Sequence alignment and hidden Markov models (HMM)
  • analysis of 2D data (gene expression data and graphs)
  • clustering algorithms
  • classification algorithms
  • subspace clustering/bi-clustering
  • genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs
  • Prerequisites

    CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

    Course Format

    The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.

    Relation to Other Courses

    This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".

    References (books)

  • Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002
  • Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001
  • An Introduction to Bioinformatics Algorithms, Neil C. Jones and Pavel Pevzner, the MIT Press, 2004.
  • References (papers)

  • Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]
  • Alberto Apostolico, Mary Ellen Bock, Stefano Lonardi, Xuyan Xu, "Efficient Detection of Unusual Words", Journal of Computational Biology, vol.7, no.1/2, pp.71-94, 2000 [PDF format]
  • Gesine Reinert, Sophie Schbath, Michael S. Waterman, "Probabilistic and Statistical Properties of Words: An Overview", Journal of Computational Biology, vol.7, no.1/2, 2000 [PDF format]
  • Paolo Ferragina, Giovanni Manzini, "Opportunistic Data Structures with Applications", FOCS 2000 [PDF format]
  • Jeremy Buhler, Uri Keich, Yanni Sun, "Designing Seeds for Similarity Search in Genomic DNA", RECOMB 2003 [PDF format]
  • Avak Kahvejian, John Quackenbush & John F Thompson, "What would you do if you could sequence everything?", Nature Biotechnology, 2008 [PDF format]
  • Michael L. Metzker, "Sequencing technologies - the next generation", Nature Reviews Genetics, 2010 [PDF format]
  • Slides

  • Slides [PDF Format 2slides/page] (Course Overview)
  • Slides [PDF Format 2slides/page] (Intro to Mol Biology)
  • Slides [PDF Format 2slides/page] (Mol Biology Tools)
  • Slides [PDF Format 2slides/page] (Statistics for Sequence Analysis)
  • Slides [PDF Format 2slides/page] (HMMs)
  • Slides [PDF Format 2slides/page] (Indexing and Searching)
  • Slides [PDF Format 2slides/page] (Bio Networks)
  • Resources

  • RNAi animation (Nature Genetics)
  • The inner life of a Cell
  • DNA Molecular animation
  • A bioinformatics glossary
  • What's a Genome (on-line book)
  • DNA interactive
  • Primer on Molecular Genetics
  • PMP Resources
  • Projects

  • Project ideas and rules
  • Yousra's CS 234 project webpage
  • Matt's CS 234 project webpage
  • Tanzirul's CS 234 webpage
  • Steve's CS 234 project webpage
  • Scott's CS 234 project webpage
  • Nurjahan's CS 234 project webpage
  • Dave's CS 234 webpage
  • Ron-Micheal's CS 234 webpage
  • Sean's CS 234 webpage
  • Mike's CS 234 webpage
  • Yi-Wen's CS 234 webpage
  • Jie's CS 234 webpage
  • Homework

  • Homework 1 (posted Jan 17, due Jan 31)
  • Homework 2 (posted Feb 2, due Feb 16)
  • Homework 2 solution (by Y. Wu)
  • Homework 3 (posted Feb 21, due Mar 6)
  • Midterm

  • Midterm 2009 (posted Feb 2)
  • Presentation

  • choose a slot 1-12 below and send me your choice
  • choose a paper among the Proceedings of RECOMB 2011 or ISMB/ECBB 2011 and send the title to me
  • send the Powerpoint file to me the day before the presentation (before 5pm)
  • give the 18 minutes presentation (make sure you time it correctly, I will stop you after 18 mins)
  • Calendar of Lectures

  • Jan 10: Intro, Molecular Biology
  • Jan 12: Molecular Biology
  • Jan 17: Molecular Biology [hw1 posted]
  • Jan 19: Molecular Biology
  • Jan 24: Molecular Biology Tools
  • Jan 26: Molecular Biology Tools
  • Jan 31: Statistics and Probability [hw1 due]
  • Feb  2: Statistics and Probability [hw2 posted]
  • Feb  7: Statistics and Probability
  • Feb  9: Statistics and Probability, HMMs
  • Feb 14: Guest Lecture
  • Feb 16: Guest Lecture [hw2 due]
  • Feb 21: HMMs [hw3 posted]
  • Feb 23: HMMs, Mapping
  • Feb 28: Mapping
  • Mar  1: Mapping
  • Mar  6: MIDTERM (in class, closed books, closed notes) [hw3 due]
  • Mar  8: Presentations. (deadline for the PPT file is Mar 7, 5PM)
    1: Matt (A hybrid approach to extract protein-protein interactions, Bioinformatics 2011)
    2: Tanzirul (Hapsembler: An Assembler for Highly Polymorphic Genomes, RECOMB'11)
    3: Ron-Michael (Predicting site-specific human selective pressure using evolutionary signatures, ISMB'11)
    4: Yi-Wen (IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly)
  • Mar 13: Presentations. (deadline for the PPT file is Mar 12, 5PM)
    5: Steve (Physical Module Networks: an integrative approach for reconstructing transcription regulation, ISMB'11)
    6: Nurjahan (Learning Cellular Sorting Pathways Using Protein Interactions and Sequence Motifs, RECOMB'11)
    7: Dave (Blocked Pattern Matching Problem and Its Applications in Proteomics, RECOMB'11)
    8: Yousra (An enhanced Petri-net model to predict synergistic effects of†pairwise drug combinations from gene microarray data, ISMB'11)
  • Mar 15: Presentations. (deadline for the PPT file is Mar 14, 5PM)
    9: Scott (Tanglegrams for rooted phylogenetic trees and networks, ISMB'11)
    10: Jie (MeSH: a window into full text for document summarization, ISMB'11)
    11: Mike (Automatic 3D neuron tracing using all-paths pruning, ISMB 2011)
    12: Sean (Identifying Branched Metabolic Pathways by Merging Linear Metabolic Pathways, RECOMB 2011)
  • Project Demo (20-25 minutes demo, 5-10 minutes questions, in my office, please bring your laptop)

  • Monday, March 19
              3:00 Scott
  • Wednesday, March 21
              10:30 Sean
              11:00 Ron-Michael
              11:30 Yi-Wen
              1:30 Tanzirul
              2:00 Yousra
  • Thursday, March 22
              11:00 Mike
              11:30 Jie
              1:00 Dave
              1:30 Nurjahan
              2:00 Matt
              2:30 Steve