CS 234, Winter 2013: Computational Methods for the Analysis of Biomolecular Data

An impressive wealth of data has being ammassed by genome sequencing projects and other efforts to determine the structures and functions of biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze, cluster and classify biomolecular data.

Class Meeting

TR, 2:10 p.m. - 3:30 p.m. INTS 2134

Office hours

Open door policy or by appointment (email me)

Preliminary list of topics

  • overview on probability and statistics
  • intro to molecular and computational biology
  • analysis of 1D sequence data (DNA, RNA, proteins)
  • Space-efficient data structures for sequences
  • Short read mapping (suffix trees, suffix arrays, BWT)
  • Sequence alignment and hidden Markov models (HMM)
  • analysis of 2D data (gene expression data and graphs)
  • clustering algorithms
  • classification algorithms
  • subspace clustering/bi-clustering
  • genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs
  • Prerequisites

    CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

    Course Format

    The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.

    Relation to Other Courses

    This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".

    References (books)

  • Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002
  • Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001
  • An Introduction to Bioinformatics Algorithms, Neil C. Jones and Pavel Pevzner, the MIT Press, 2004.
  • References (papers)

  • Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]
  • Paolo Ferragina, Giovanni Manzini, "Opportunistic Data Structures with Applications", FOCS 2000 [PDF format]
  • Jeremy Buhler, Uri Keich, Yanni Sun, "Designing Seeds for Similarity Search in Genomic DNA", RECOMB 2003 [PDF format]
  • Avak Kahvejian, John Quackenbush & John F Thompson, "What would you do if you could sequence everything?", Nature Biotechnology, 2008 [PDF format]
  • Michael L. Metzker, "Sequencing technologies - the next generation", Nature Reviews Genetics, 2010 [PDF format]
  • Slides

  • Slides [PDF Format 2slides/page] (Course Overview)
  • Slides [PDF Format 2slides/page] (Intro to Mol Biology)
  • Slides [PDF Format 2slides/page] (Mol Biology Tools)
  • Slides [PDF Format 2slides/page] (Indexing and Searching)
  • Slides [PDF Format 2slides/page] (Probability Models and Inference)
  • Slides [PDF Format 2slides/page] (Bio Networks)
  • Resources

  • RNAi animation (Nature Genetics)
  • The inner life of a Cell
  • DNA Molecular animation
  • A bioinformatics glossary
  • What's a Genome (on-line book)
  • DNA interactive
  • Experimental Genome Science (on-line course)
  • Current Topics in Genome Analysis 2012 (on-line course)
  • Fundamentals of Biology (on-line course)
  • Projects

  • Project ideas and rules
  • Pavan's CS 234 project webpage
  • Xin's CS 234 project webpage
  • Bryan's CS 234 project webpage
  • Keval's CS 234 project webpage
  • Rachid's CS 234 project webpage
  • Feroz's CS 234 project webpage
  • Sara's CS 234 project webpage
  • Kenneth's CS 234 project webpage
  • Gurneet's CS 234 project webpage
  • Matt's CS 234 project webpage
  • Farzad's CS 234 project webpage
  • Zhigang's CS 234 project webpage
  • James' CS 234 project webpage
  • Panruo's CS 234 project webpage
  • Nicholas' CS 234 project webpage
  • John's CS 234 project webpage
  • Hind's CS 234 project webpage
  • Yanping's CS 234 project webpage
  • Homework

  • Homework 1 (posted Jan 15, due Jan 29)
  • Homework 2 (posted Jan 31, due Feb 19)
  • Homework 2 solution
  • Homework 3 (posted Feb 19, due Mar 5)
  • Midterm

  • Mock midterm exam (posted Feb 26)
  • Presentation

  • choose a paper among the Proceedings of RECOMB 2012 or ISMB 2012 and send the title to me and the slot (1-18) when you want to present, see below
  • send the Powerpoint file to me the day before the presentation (before 5pm)
  • give the 18 minutes presentation (make sure you time it correctly, I will stop you after 18 mins)
  • Calendar of Lectures

  • Jan 8: Intro, Molecular Biology (1-15)
  • Jan 10: Molecular Biology (16-)
  • Jan 15: Molecular Biology [hw1 posted]
  • Jan 17: Molecular Biology
  • Jan 22: Molecular Biology
  • Jan 24: Molecular Biology Tools
  • Jan 29: Molecular Biology Tools [hw1 due]
  • Jan 31: Indexing [hw2 posted]
  • Feb  5: Indexing
  • Feb  7: Indexing
  • Feb  12: Indexing
  • Feb 14: Guest lecture
  • Feb 19: Probability and Statistics [hw2 due] [hw3 posted]
  • Feb 21: Probability and Statistics
  • Feb 26: Probability and Statistics
  • Feb 28: Biological Networks
  • Mar  5 MIDTERM (in class, closed books, closed notes) [hw3 due]
  • Mar  7: Presentations (deadline for the PPT file is Mar 6th, 5PM)
    1: John (Recovering the Tree-Like Trend of Evolution Despite Extensive Lateral Genetic Transfer: A Probabilistic Analysis)
    2: Bryan (ChemSpot: a hybrid system for chemical named entity recognition)
    3: James (MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins)
    4: Zhigang (Tachyon search speeds up retrieval of similar sequences by several orders of magnitude)
  • Mar 12: Presentations (deadline for the PPT file is Mar 11th, 5PM)
    5: Gurneet (Matching experiments across species using expression values and textual information)
    6: Keval (A single source k-shortest paths algorithm to infer regulatory pathways in a gene network)
    7: Yanping (Ballast: A Ball-Based Algorithm for Structural Motifs)
    8: Hind (GenomeRing: alignment visualization based on SuperGenome coordinates)
  • Mar 14: Presentations (deadline for the PPT file is Mar 13th, 5PM)
    9: Sara (iDELISHUS: an efficient and exact algorithm for genome-wide detection of deletion polymorphism in autism)
    10: Rachid (Protein structure by semidefinite facial reduction)
    11: Feroz (Simultaneously Learning DNA Motif along with Its Position and Sequence Rank Preferences through EM Algorithm)
    12: Nicholas (MotEvo: integrated Bayesian probabilistic methods for inferring regulatory sites and motifs on multiple alignments of DNA sequences)
  • Mar 19 Presentations (in class 11:30am-1:30pm, deadline for the PPT file is Mar 18th, 5PM)
    13: Farzad (Using GPUs for the Exact Alignment of Short-Read Genetic Sequences by Means of the Burrows-Wheeler Transform)
    14: Panruo (SEQuel: improving the accuracy of genome assemblies)
    15: Matt (Pathset Graphs: A Novel Approach for Comprehensive Utilization of Paired Reads in Genome Assembly)
    16: Pavan (CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform)
    17: Xin (Minimum message length inference of secondary structure from protein coordinate data)
    18: Kenny (TrueSight: Self-training Algorithm for Splice Junction Detection Using RNA-seq)
  • Project Demo (20-25 minutes demo, 5-10 minutes questions, in my office, please bring your laptop)

  • Friday, March 15
              10:30 Zhigang
              11:30 John
  • Monday, March 18
              10:00 Keval
              10:30 Rachid
              11:00 Xin
              11:30 Kenny
              12:00 Nicholas
               4:00 Panruo
               4:30 Pavan
               5:00 Sara
  • Tuesday, March 19
              10:00 Farzad
              10:30 James
               2:00 Matt
               2:30 Gurneet
               3:00 Bryan
               3:30 Feroz
               4:00 Yanping
               4:30 Hind