CS 234, W16: Computational Methods for the Analysis of Biomolecular Data


  • Homework 3 posted
  • hw2 solution posted
  • Mock exam posted
  • Prob models slides posted
  • Rachid's slides posted
  • Homework 2 posted
  • Slides posted (Mol Biology Tools, and Indexing/Searching)
  • Homework 1 posted
  • Happy New Year!
  • Overview

    An impressive wealth of data has being ammassed by genome sequencing projects and other efforts to determine the structures and functions of biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze, cluster and classify biomolecular data.

    Class Meeting

    TR, 12:40noon - 2:00 p.m. INTS 2132

    Office hours

    Open door policy or by appointment (email me)

    Preliminary list of topics

  • overview on probability and statistics
  • intro to molecular and computational biology
  • analysis of 1D sequence data (DNA, RNA, proteins)
  • Space-efficient data structures for sequences
  • Short read mapping (suffix trees, suffix arrays, BWT)
  • Sequence alignment and hidden Markov models (HMM)
  • analysis of 2D data (gene expression data and graphs)
  • clustering algorithms
  • classification algorithms
  • subspace clustering/bi-clustering
  • genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs
  • Prerequisites

    CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

    Course Format

    The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.

    Relation to Other Courses

    This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".

    References (books)

  • Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002
  • Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001
  • An Introduction to Bioinformatics Algorithms, Neil C. Jones and Pavel Pevzner, the MIT Press, 2004.
  • Understanding Bioinformatics, Marketa Zvelebil, Jeremy O. Baum, Garland Science, 2007
  • References (papers)

  • Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]
  • Paolo Ferragina, Giovanni Manzini, "Opportunistic Data Structures with Applications", FOCS 2000 [PDF format]
  • Jeremy Buhler, Uri Keich, Yanni Sun, "Designing Seeds for Similarity Search in Genomic DNA", RECOMB 2003 [PDF format]
  • Avak Kahvejian, John Quackenbush, John F Thompson, "What would you do if you could sequence everything?", Nature Biotechnology, 2008 [PDF format]
  • Michael L. Metzker, "Sequencing technologies - the next generation", Nature Reviews Genetics, 2010 [PDF format]
  • Slides

  • Slides [PDF Format 2slides/page] (Course Overview)
  • Slides [PDF Format 2slides/page] (Intro to Mol Biology)
  • Slides [PDF Format 2slides/page] (Mol Biology Tools)
  • Rachid's slides
  • Slides [PDF Format 2slides/page] (Indexing and Searching)
  • Slides [PDF Format 2slides/page] (Probability Models and Inference)
  • Resources

  • RNAi animation (Nature Genetics)
  • The inner life of a Cell
  • DNA Molecular animation
  • A bioinformatics glossary
  • What's a Genome (on-line book)
  • DNA interactive
  • Experimental Genome Science (on-line course)
  • Current Topics in Genome Analysis 2014 (on-line course)
  • Fundamentals of Biology (on-line course)
  • Pevzner's bioinformatics courses (on-line)
  • Projects

  • Project ideas and rules
  • Anh's CS 234 webpage
  • Waleed's CS 234 webpage
  • Tiantian's CS 234 webpage
  • Reeta's CS 234 webpage
  • Frank's CS 234 webpage
  • Dipan's CS 234 webpage
  • Priyanka's CS 234 webpage
  • Nikhil's CS 234 webpage
  • Qihua's CS 234 webpage
  • Homework

  • Homework 1 (posted Jan 15, due Jan 28)
  • Homework 2 (posted Jan 28, due Feb 16)
  • Homework 2 solution
  • Homework 3 (posted Feb 18, due Mar 3)
  • Midterm

  • Mock midterm exam (posted Feb 16)
  • Presentation

  • choose a paper among the Proceedings of RECOMB 2015 or ISMB 2015 and send the title to me and the slot number (1-12) when you want to present, see below
  • send the Powerpoint file to me the day before the presentation (before 5pm)
  • give the 16 minutes presentation (make sure you time it correctly, I will stop you at 16 mins)
  • Calendar of Lectures

  • Jan  5: Intro, Molecular Biology (1-21)
  • Jan  7: Molecular Biology (22-42)
  • Jan 12: Molecular Biology (43-59)
  • Jan 14: Molecular Biology (60-end) [hw1 posted]
  • Jan 19: Tools for Molecular Biology
  • Jan 21: Tools for Molecular Biology
  • Jan 26: Tools for Molecular Biology
  • Jan 28: Indexing/Searching [hw1 due][hw2 posted]
  • Feb  3: Indexing/Searching (Guest Lecture on Hash Tables and Applications to Metagenomics, R. Ounit)
  • Feb  5: Indexing/Searching
  • Feb  9: Indexing/Searching
  • Feb 11: Indexing/Searching
  • Feb 16: Indexing/Searching, Probability models [hw2 due] [hw3 posted]
  • Feb 18: Probability models
  • Feb 23: Probability models
  • Feb 25: Probability models, Biological Networks
  • Mar  1 MIDTERM (80 minutes, in class, closed books, closed notes) [hw3 due]
  • Mar  3: Presentations (deadline for the PPT file is Mar 2nd, 5PM)
    1: Stefano Biological Networks
    2: Stefano Biological Networks
    3: Dipan Protein (multi-)location prediction: utilizing interdependencies via a generative model (ISMB '15)
    4: Qihua Comparing genomes with rearrangements and segmental duplications (ISMB'15)
  • Mar  8: Presentations (deadline for the PPT file is Mar 7th, 5PM)
    5: Reeta MeSHLabeler: improving the accuracy of large-scale MeSH indexing by integrating diverse evidence (ISMB'15)
    6: Waleed FERAL: network-based classifier with application to breast cancer outcome prediction (ISMB'15)
    7: Tiantian Starcode: sequence clustering based on all-pairs search" (ISMB 2015)
    8: Nikhil Gap Filling as Exact Path Length Problem (RECOMB 2015)
  • Mar 10: Presentations (deadline for the PPT file is Mar 9th, 5PM)
    10: Frank A hierarchical Bayesian model for flexible module discovery in three-way time-series data (ISMB'15)
    11: Anh CRISPR Detection from Short Reads Using Partial Overlap Graphs - RECOMB'15
    12: Priyanka Robust reconstruction of gene expression profiles from reporter gene data using linear inversion (ISMB '15)
  • Project Demo (20-25 minutes demo, 5-10 minutes questions, in my office, please bring your laptop)

  • Monday, March 14th
              10:00 Frank
              10:30 Anh
  • Tuesday, March 15th
              1:30 Qihua
              2:00 Priyanka
              2:30 Nikhil
              3:00 Dipan
  • Wednesday, March 16th
              9:30 Reeta
              10:00 Waleed
              10:30 Tiantian