CS 234, Winter 2012, Computational Methods for the Analysis of Biomolecular Data

CS 234, Winter 2012: Computational Methods for the Analysis of Biomolecular Data

An impressive wealth of data has being ammassed by genome sequencing projects and other efforts to determine the structures and functions of biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze, cluster and classify biomolecular data.

Class Meeting

12:40 p.m. - 02:00 p.m. INTS 2134

Office hours

Open door policy or by appointment (email me)

Preliminary list of topics

overview on probability and statistics

intro to molecular and computational biology

analysis of 1D sequence data (DNA, RNA, proteins)

Space-efficient data structures for sequences

Short read mapping (suffix trees, suffix arrays, BWT)

Sequence alignment and hidden Markov models (HMM)

analysis of 2D data (gene expression data and graphs)

clustering algorithms

classification algorithms

subspace clustering/bi-clustering

genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs

Prerequisites

CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

Course Format

The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.

Relation to Other Courses

This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".

References (books)

Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.

Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.

Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002

Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001

An Introduction to Bioinformatics Algorithms, Neil C. Jones and Pavel Pevzner, the MIT Press, 2004.

References (papers)

Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]

Alberto Apostolico, Mary Ellen Bock, Stefano Lonardi, Xuyan Xu, "Efficient Detection of Unusual Words", Journal of Computational Biology, vol.7, no.1/2, pp.71-94, 2000 [PDF format]

Gesine Reinert, Sophie Schbath, Michael S. Waterman, "Probabilistic and Statistical Properties of Words: An Overview", Journal of Computational Biology, vol.7, no.1/2, 2000 [PDF format]

Paolo Ferragina, Giovanni Manzini, "Opportunistic Data Structures with Applications", FOCS 2000 [PDF format]

Jeremy Buhler, Uri Keich, Yanni Sun, "Designing Seeds for Similarity Search in Genomic DNA", RECOMB 2003 [PDF format]

Avak Kahvejian, John Quackenbush & John F Thompson, "What would you do if you could sequence everything?", Nature Biotechnology, 2008 [PDF format]

Michael L. Metzker, "Sequencing technologies - the next generation", Nature Reviews Genetics, 2010 [PDF format]

Slides

Slides [PDF Format 2slides/page] (Course Overview)

Slides [PDF Format 2slides/page] (Intro to Mol Biology)

Slides [PDF Format 2slides/page] (Mol Biology Tools)

Slides [PDF Format 2slides/page] (Statistics for Sequence Analysis)

Slides [PDF Format 2slides/page] (HMMs)

Slides [PDF Format 2slides/page] (Indexing and Searching)

Slides [PDF Format 2slides/page] (Bio Networks)

Resources

RNAi animation (Nature Genetics)

The inner life of a Cell

DNA Molecular animation

A bioinformatics glossary

What's a Genome (on-line book)

DNA interactive

Primer on Molecular Genetics

PMP Resources

Projects

Project ideas and rules

Yousra's CS 234 project webpage

Matt's CS 234 project webpage

Tanzirul's CS 234 webpage

Steve's CS 234 project webpage

Scott's CS 234 project webpage

Nurjahan's CS 234 project webpage

Dave's CS 234 webpage

Ron-Micheal's CS 234 webpage

Sean's CS 234 webpage

Mike's CS 234 webpage

Yi-Wen's CS 234 webpage

Jie's CS 234 webpage

Homework

Homework 1 (posted Jan 17, due Jan 31)

Homework 2 (posted Feb 2, due Feb 16)

Homework 2 solution (by Y. Wu)

Homework 3 (posted Feb 21, due Mar 6)

Midterm

Midterm 2009 (posted Feb 2)

Presentation

choose a slot 1-12 below and send me your choice

choose a paper among the Proceedings of RECOMB 2011 or ISMB/ECBB 2011 and send the title to me

send the Powerpoint file to me the day before the presentation (before 5pm)

give the 18 minutes presentation (make sure you time it correctly, I will stop you after 18 mins)

Calendar of Lectures

Jan 10: Intro, Molecular Biology

Jan 12: Molecular Biology

Jan 17: Molecular Biology [hw1 posted]

Jan 19: Molecular Biology

Jan 24: Molecular Biology Tools

Jan 26: Molecular Biology Tools

Jan 31: Statistics and Probability [hw1 due]

Feb 2: Statistics and Probability [hw2 posted]

Feb 7: Statistics and Probability

Feb 9: Statistics and Probability, HMMs

Feb 14: Guest Lecture

Feb 16: Guest Lecture [hw2 due]

Feb 21: HMMs [hw3 posted]

Feb 23: HMMs, Mapping

Feb 28: Mapping

Mar 1: Mapping

Mar 6: MIDTERM (in class, closed books, closed notes) [hw3 due]

Mar 8: Presentations. (deadline for the PPT file is Mar 7, 5PM)
1: Matt (A hybrid approach to extract protein-proteinÊinteractions, Bioinformatics 2011)
2: Tanzirul (Hapsembler: An Assembler for Highly Polymorphic Genomes, RECOMB'11)
3: Ron-Michael (Predicting site-specific human selective pressure using evolutionary signatures, ISMB'11)
4: Yi-Wen (IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly)

Mar 13: Presentations. (deadline for the PPT file is Mar 12, 5PM)
5: Steve (Physical Module Networks: an integrative approach for reconstructing transcription regulation, ISMB'11)
6: Nurjahan (Learning Cellular Sorting Pathways Using Protein Interactions and Sequence Motifs, RECOMB'11)
7: Dave (Blocked Pattern Matching Problem and Its Applications in Proteomics, RECOMB'11)
8: Yousra (An enhanced Petri-net model to predict synergistic effects of pairwise drug combinations from gene microarray data, ISMB'11)

Mar 15: Presentations. (deadline for the PPT file is Mar 14, 5PM)
9: Scott (Tanglegrams for rooted phylogenetic trees and networks, ISMB'11)
10: Jie (MeSH: a window into full text for document summarization, ISMB'11)
11: Mike (Automatic 3D neuron tracing using all-paths pruning, ISMB 2011)
12: Sean (Identifying Branched Metabolic Pathways by Merging Linear Metabolic Pathways, RECOMB 2011)

Project Demo (20-25 minutes demo, 5-10 minutes questions, in my office, please bring your laptop)

Monday, March 19
          3:00 Scott

Wednesday, March 21
          10:30 Sean
          11:00 Ron-Michael
          11:30 Yi-Wen
          1:30 Tanzirul
          2:00 Yousra

Thursday, March 22
          11:00 Mike
          11:30 Jie
          1:00 Dave
          1:30 Nurjahan
          2:00 Matt
          2:30 Steve