CS 234: Computational Methods for the Analysis of Biomolecular Data


  • (Feb 25) Solution homework 2 posted.
  • (Feb 19) Bio networks slides posted
  • (Feb 14) Homework 3 posted
  • (Feb 12) Mock exam posted.
  • (Feb 6) Probability slides posted
  • (Feb 1) Homework 2 posted
  • (Jan 29) Indexing/searching slides posted
  • (Jan 29) Homework 1 updated
  • (Jan 21) Mol biology tools slides posted
  • (Jan 17) Homework 1 posted
  • Happy New Year!
  • Overview

    An impressive wealth of data has being ammassed by genome/metagenome/epigenetic projects and other efforts to determine the structure and function of molecular biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze biomolecular data.

    Class Meeting

    TR, 11:10 - 12:30AM WCH 142

    Office hours

    Open door policy or by appointment (email me)

    Preliminary list of topics

  • overview on probability and statistics
  • intro to molecular and computational biology
  • analysis of 1D sequence data (DNA, RNA, proteins)
  • Space-efficient data structures for sequences
  • Short read mapping (suffix trees, suffix arrays, BWT)
  • Sequence alignment and hidden Markov models (HMM)
  • analysis of 2D data (gene expression data and graphs)
  • clustering algorithms
  • classification algorithms
  • subspace clustering/bi-clustering
  • genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs
  • Prerequisites

    CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

    Course Format

    The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.

    Relation to Other Courses

    This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".

    References (books)

  • Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002
  • Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001
  • Neil C. Jones and Pavel Pevzner, An Introduction to Bioinformatics Algorithms, MIT Press, 2004.
  • Marketa Zvelebil, Jeremy O. Baum, Understanding Bioinformatics, Garland Science, 2007
  • References (papers)

  • Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]
  • Paolo Ferragina, Giovanni Manzini, "Opportunistic Data Structures with Applications", FOCS 2000 [PDF format]
  • Jeremy Buhler, Uri Keich, Yanni Sun, "Designing Seeds for Similarity Search in Genomic DNA", RECOMB 2003 [PDF format]
  • Avak Kahvejian, John Quackenbush, John F Thompson, "What would you do if you could sequence everything?", Nature Biotechnology, 2008 [PDF format]
  • Michael L. Metzker, "Sequencing technologies - the next generation", Nature Reviews Genetics, 2010 [PDF format]
  • Slides

  • Slides [PDF Format 2slides/page] (Course Overview)
  • Slides [PDF Format 2slides/page] (Intro to Mol Biology)
  • Slides [PDF Format 2slides/page] (Mol Biology Tools)
  • Slides [PDF Format 2slides/page] (Indexing and Searching)
  • Slides [PDF Format 2slides/page] (Probability Models and Inference)
  • Slides [PDF Format 2slides/page] (Bio Networks)
  • Resources

  • CS 234 Fold it! group
  • RNAi animation (Nature Genetics)
  • The inner life of a Cell
  • DNA Molecular animation
  • DNA interactive
  • Experimental Genome Science (on-line course)
  • Current Topics in Genome Analysis 2014 (on-line course)
  • Fundamentals of Biology (on-line course)
  • Pevzner's bioinformatics courses (on-line)
  • Projects

  • Project ideas and rules
  • create your CS 234 webpage on google
  • Xinru Qiu's project
  • Haoping Wang's project
  • Sadaf Tafazoli's project
  • MD. Omar Faruk Rokon's project
  • Risul Islam's project
  • Parker Newton's project
  • Sayanton Vhaduri Dibbo's project
  • Tung Dinh's project
  • Li Guo's project
  • Nathanael Roy's project
  • Farzin Houshmand's project
  • Mayur Patil's project
  • Tariq Shams's project
  • Isaac Quintanilla Salinas's project
  • Homework

  • Homework 1 (posted Jan 17, updated Jan 29, due Jan 31)
  • Homework 2 (posted Jan 31, due Feb 14)
  • Homework 2 solution
  • Homework 3 (posted Feb 14, due Feb 28)
  • Midterm

  • Mock midterm exam (posted Feb 12)
  • Presentation

  • Choose a paper among the Proceedings of RECOMB 2018 or RECOMB 2017 or ISMB 2018 and send the title to me and the slot number (1-16) when you want to present, see below for the availability
  • Email me the Powerpoint file the day before the presentation (before 5pm)
  • Give the 25 minutes presentation (make sure you time it correctly, I will stop you at 25 mins)
  • Calendar of Lectures

    Week 1
  • Jan  8: Intro, Molecular Biology (1-11)
  • Jan 10: Molecular Biology (12-46)
  • Week 2
  • Jan 15: Molecular Biology (47-74)
  • Jan 17: Molecular Biology (75-end), Mol Biology Tools (1-8)[hw1 posted]
  • Week 3
  • Jan 22: Mol Biology Tools (9-31)
  • Jan 24: Mol Biology Tools (32-end)
  • Week 4
  • Jan 29: Indexing/Searching (1-40)
  • Jan 31: Indexing/Searching (41-78)[hw1 due][hw2 posted]
  • Week 5
  • Feb  5: Indexing/Searching (79-112)
  • Feb  7: Indexing/Searching (113-end), Probability (1-24)
  • Week 6
  • Feb 12: Probability (24-66)
  • Feb 14: Probability (67-97) [hw2 due][hw3 posted]
  • Week 7
  • Feb 19: Probability (98-end), Networks (1-38)
  • Feb 21: Networks (39-70)
  • Week 8
  • Feb 26: Networks (71-end)
  • Feb 28: MIDTERM (80 minutes, in class, closed books, closed notes) [hw3 due]
  • Week 9
  • Mar  5: Presentations (deadline for the PPT file is Mar 4th, 5PM)
    1: Tarique Shams: Bayesian networks for mass spectrometric metabolite identification via molecular fingerprints
    2: Tung Dinh: Chromatyping: Reconstructing Nucleosome Profiles from NOMe Sequencing Data
  • Mar  7: Presentations (deadline for the PPT file is Mar 6th, 5PM)
    4: Li Guo: Longitudinal Genotype-Phenotype Association Study via Temporal Structure Auto-learning Predictive Model
    5: Shima Imani: Circular Networks from Distorted Metrics
    6: Md Omar Faruk Rokon: GTED: Graph Traversal Edit Distance
  • Week 10
  • Mar 12: Presentations (deadline for the PPT file is Mar 11th, 5PM)
    7: Xinru Qiu: Minimap2: pairwise alignment for nucleotide sequences, ISMB 2018
    8: Isaac Quintanilla Salinas: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment
    9: Mayur Patil: Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index", RECOMB 2018
  • Mar 14: Presentations (deadline for the PPT file is Mar 13th, 5PM)
    10: Parker Newton: AnoniMME: bringing anonymity to the matchmaker exchange platform for rare disease gene discovery
    11: Sadaf Tafazoli: Long Reads Enable Accurate Estimates of Complexity of Metagenomes
    12: Nathanael Roy: Superbubbles, Ultrabubbles and Cacti
  • Project Demo (20-25 minutes demo, 5-10 minutes questions, in my office, please bring your laptop): sign up