CS 234, W17: Computational Methods for the Analysis of Biomolecular Data


  • hw3 solution posted
  • Slides posted (probability and statistics)
  • Mock exam posted
  • Slides posted (indexing and searching)
  • Slides posted (mol biology tools)
  • Projects posted
  • Homework 1 posted
  • Slides posted (molecular biology)
  • Happy New Year!
  • Overview

    An impressive wealth of data has being ammassed by genome/metagenome sequencing projects and other efforts to determine the structure and function of molecular biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze biomolecular data.

    Class Meeting

    MWF, 10:10 - 11:00 AM MSE 103

    Office hours

    Open door policy or by appointment (email me)

    Preliminary list of topics

  • overview on probability and statistics
  • intro to molecular and computational biology
  • analysis of 1D sequence data (DNA, RNA, proteins)
  • Space-efficient data structures for sequences
  • Short read mapping (suffix trees, suffix arrays, BWT)
  • Sequence alignment and hidden Markov models (HMM)
  • analysis of 2D data (gene expression data and graphs)
  • clustering algorithms
  • classification algorithms
  • subspace clustering/bi-clustering
  • genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs
  • Prerequisites

    CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

    Course Format

    The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.

    Relation to Other Courses

    This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".

    References (books)

  • Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002
  • Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001
  • Neil C. Jones and Pavel Pevzner, An Introduction to Bioinformatics Algorithms, MIT Press, 2004.
  • Marketa Zvelebil, Jeremy O. Baum, Understanding Bioinformatics, Garland Science, 2007
  • References (papers)

  • Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]
  • Paolo Ferragina, Giovanni Manzini, "Opportunistic Data Structures with Applications", FOCS 2000 [PDF format]
  • Jeremy Buhler, Uri Keich, Yanni Sun, "Designing Seeds for Similarity Search in Genomic DNA", RECOMB 2003 [PDF format]
  • Avak Kahvejian, John Quackenbush, John F Thompson, "What would you do if you could sequence everything?", Nature Biotechnology, 2008 [PDF format]
  • Michael L. Metzker, "Sequencing technologies - the next generation", Nature Reviews Genetics, 2010 [PDF format]
  • Slides

  • Slides [PDF Format 2slides/page] (Course Overview)
  • Slides [PDF Format 2slides/page] (Intro to Mol Biology)
  • Slides [PDF Format 2slides/page] (Mol Biology Tools)
  • Slides [PDF Format 2slides/page] (Indexing and Searching)
  • Slides [PDF Format 2slides/page] (Probability Models and Inference)
  • Resources

  • RNAi animation (Nature Genetics)
  • The inner life of a Cell
  • DNA Molecular animation
  • A bioinformatics glossary
  • What's a Genome (on-line book)
  • DNA interactive
  • Experimental Genome Science (on-line course)
  • Current Topics in Genome Analysis 2014 (on-line course)
  • Fundamentals of Biology (on-line course)
  • Pevzner's bioinformatics courses (on-line)
  • Projects

  • Project ideas and rules
  • Amir's CS 234 webpage
  • Ekta Gujral's CS 234 webpage
  • Elaine Leung's CS 234 webpage
  • Hao Chen's CS 234 webpage
  • Tsung-Ying's CS 234 webpage
  • Sharmistha Bardhan's CS 234 webpage
  • Dipankar ranjan Baisya's CS 234 webpage
  • Xiu Zhang's CS 234 webpage
  • Po-Hung Lu's CS 234 webpage
  • Yawei Li's CS 234 webpage
  • Raghavendra Dinesh's CS 234 webpage
  • Abhignana Mihir's CS 234 webpage
  • Chengkuan Hong's CS 234 webpage
  • Yangyang Hu's CS 234 webpage
  • Lufei Xie's CS 234 webpage
  • Shravani Madhavaram's CS 234 webpage
  • Lalitha Dwarapudi's CS 234 webpage
  • Ravdeep Pasricha's CS 234 webpage
  • Nan Xiong's CS 234 webpage
  • Shipra Jais's CS 234 webpage
  • Nathan Robertson's CS 234 webpage
  • Homework

  • Homework 1 (posted Jan 20, due Feb 3)
  • Homework 2 (posted Feb 3, due Feb 17)
  • Homework 2 solution
  • Homework 3 (posted Feb 17, due Mar 3)
  • Homework 3 solution
  • Midterm

  • Mock midterm exam (posted Feb 3)
  • Reports

  • Select a paper from the list provided and write a four pages report (11 points font, 1 inch margin, single spaced), excluding bibliography and figures/tables (they can go at the end). Warning: cutting and pasting text from the original paper to your report is considered cheating
  • Presentation

  • Choose a paper among the Proceedings of RECOMB 2016 or ISMB 2016 and send the title to me and the slot number (1-14) when you want to present, see below for the availability
  • Email me the Powerpoint file the day before the presentation (before 5pm)
  • Give the 14-15 minutes presentation (make sure you time it correctly, I will stop you at 15 mins)
  • Calendar of Lectures

  • Jan  9: Intro, Molecular Biology (1-11)
  • Jan 11: Molecular Biology (12-28)
  • Jan 13: Molecular Biology (29-43)
  • Jan 16: Holiday
  • Jan 18: Molecular Biology (44-59)
  • Jan 20: Molecular Biology (60-76) [hw1 posted]
  • Jan 23: Molecular Biology (77-end), Molecular Biology Tools (1-8)
  • Jan 25: Molecular Biology Tools (8-27)
  • Jan 27: Molecular Biology Tools (28-end)
  • Jan 30: Indexing and Searching (1-10)
  • Feb  1: Indexing and Searching (11-30)
  • Feb  3: Indexing and Searching (31-68) [hw1 due][hw2 posted]
  • Feb  6: Indexing and Searching (68-86)
  • Feb  8: Indexing and Searching (86-102)
  • Feb 10: Indexing and Searching (87-103)
  • Feb 13: Indexing and Searching (104-end)
  • Feb 15: Probability models (1-24)
  • Feb 17: Probability models (25-42)[hw2 due][hw3 posted]
  • Feb 20: Holiday
  • Feb 22: Probability models
  • Feb 24: Probability models
  • Feb 27: Probability models
  • Mar  1 Probability models
  • Mar  3: Probability models [hw3 due]
  • Mar  6: MIDTERM (50 minutes, in class, closed books, closed notes)
  • Mar  8: Presentations (deadline for the PPT file is Mar 7th, 5PM)
    1: Shipra Beyond accuracy: creating interoperable and scalable text-mining web services
    2: Hao Convolutional neural network architectures for predicting DNA-protein binding, ISMB 2016
    3: Ekta Gene expression inference with deep learning
  • Mar 10: Presentations (deadline for the PPT file is Mar 9th, 5PM)
    4: Xiu Classifying and segmenting microscopy images with deep multiple instance learning
    5: Dipankar Predicting Enhancer-Promoter Interaction from Genomic Sequence with Deep Neural Networks
    6: Yangyang New Genome Similarity Measures Based on Conserved Gene Adjacencies (RECOMB)
  • Mar 13: Presentations (deadline for the PPT file is Mar 12th, 5PM)
    7 Elaine Improving Bloom filter performance on sequence data using k-mer Bloom filter
    8: Nathan A cross-species bi-clustering approach to identifying conserved co-regulated genes
    9: Yawei deBWT: parallel construction of Burrows-Wheeler Transform for large collection of genomes with de Bruijn-branch encoding
  • Mar 15: Presentations (deadline for the PPT file is Mar 14th, 5PM)
    10: Amir Privacy-preserving microbiome analysis using secure computation
    11: Chengkuan Analysis of differential splicing suggests different modes of short-term splicing regulation
  • Mar 17: Presentations (deadline for the PPT file is Mar 16th, 5PM)
    12: Lufei What time is it? Deep learning approaches for circadian rhythms
    13: Ravdeep Multitask Matrix Completion for Learning Protein Interactions Across Diseases
  • Report Due (Sharmistha Bardhan, Nan Xiong, Abhignana Mihir Kandepu, Lalitha Dwarapudi, Shravani Madhavaram, Tsung-Ying Chen, Raghavendra Dinesh Pasupuleti, Po-Hung Lu)

    Project Demo (20-25 minutes demo, 5-10 minutes questions, in my office, please bring your laptop)

  • Monday, March 20th
              9:00 Yangyang Hu
              9:30 Ekta Gujral
              10:00 Amir Nodehi Sabet
              10:30: Raghavendra Dinesh Pasupuleti
              11:00 Lufei Xie
              11:30: Shipra Jais
              2:00: Shravani Madhavaram
              2:30 Hao Chen
              3:00: Abhignana Kandepu
              3:30: Dipankar ranjan Baisya
              4:00 Sharmistha Bardhan
  • Tuesday, March 21st
              3:00 Nathan Robertson
              3:30 Tsung-Ying Chen
              4:00 Lalitha Dwarapudi
              4:30: Nan Xiong
  • Wednesday, March 22nd
              9:00: Chengkuan Hong
              9:30: Elaine Leung
              10:00: Xiu Zhang
              10:30: Yawei Li
              11:00: Po-Hung Lu
              11:30: Ravdeep Pasricha