CS 234: Computational Methods for the Analysis of Biomolecular Data


  • homework 3 solution posted
  • Slides posted
  • homework 2 solution posted
  • homework 3 posted
  • New slides posted
  • Project webpages posted
  • Posted rules for the final paper (for those of us that would prefer it instead of giving a presentation via zoom). If you prefer the report, please email me
  • Homework 2 posted
  • New slides posted
  • Homework 1 posted
  • New slides posted
  • In case you missed a lecture, go to Yuja for the recordings
  • The first lecture is Jan 4, 2021
  • Happy New Year!
  • Overview

    An impressive wealth of data has being ammassed by genome/metagenome/epigenetic projects and other efforts to determine the structure and function of molecular biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze biomolecular data.

    Class Meeting

    MWF, 10:00am - 10:50am, Zoom Meeting ID 980 3053 6579 - Email me for the password - Go to Yuja for the recordings

    Office hours

    By appointment via Zoom (email me)

    Preliminary list of topics

  • intro to molecular and computational biology, including biotech tools
  • overview on probability and statistics
  • analysis of 1D sequence data (DNA, RNA, proteins)
  • Space-efficient data structures for sequences
  • Short read mapping (suffix trees, suffix arrays, BWT)
  • Sequence alignment and hidden Markov models (HMM)
  • analysis of 2D data (gene expression data and graphs)
  • clustering algorithms
  • classification algorithms
  • subspace clustering/bi-clustering
  • genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs
  • Prerequisites

    CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.

    Course Format

    The course will include lectures by the instructor and presentations by the students. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three assignments, mostly of theoretical nature -- although some may require programming.

    Relation to Other Courses

    This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".

    References (books)

  • Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002
  • Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001
  • Neil C. Jones and Pavel Pevzner, An Introduction to Bioinformatics Algorithms, MIT Press, 2004.
  • Marketa Zvelebil, Jeremy O. Baum, Understanding Bioinformatics, Garland Science, 2007
  • References (papers)

  • Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format]
  • Paolo Ferragina, Giovanni Manzini, "Opportunistic Data Structures with Applications", FOCS 2000 [PDF format]
  • Jeremy Buhler, Uri Keich, Yanni Sun, "Designing Seeds for Similarity Search in Genomic DNA", RECOMB 2003 [PDF format]
  • Avak Kahvejian, John Quackenbush, John F Thompson, "What would you do if you could sequence everything?", Nature Biotechnology, 2008 [PDF format]
  • Michael L. Metzker, "Sequencing technologies - the next generation", Nature Reviews Genetics, 2010 [PDF format]
  • Slides

  • Slides [PDF Format 2slides/page] (Course Overview)
  • Slides [PDF Format 2slides/page] (Intro to Mol Biology)
  • Slides [PDF Format 2slides/page] (Mol Biology Tools)
  • Slides [PDF Format 2slides/page] (Indexing and Searching)
  • Slides [PDF Format 2slides/page] (Probability Models and Inference)
  • Slides [PDF Format 2slides/page] (Bio Networks)
  • Resources

  • CS 234 Fold it! group
  • RNAi animation (Nature Genetics)
  • DNA Molecular animation
  • DNA interactive
  • Genomic Data Science Specialization (Coursera)
  • Bioconductor for Genomic Data Science (Coursera)
  • Genome Sequencing (Bioinformatics II) (Coursera)
  • Introduction to Genomics (NHGRI)
  • Fundamentals of Biology (on-line course)
  • Pevzner's bioinformatics courses (Coursera)
  • Projects

  • Project ideas and rules
  • create your CS 234 webpage on google
  • Saleh Sereshki's project webpage
  • Wendy Liu's project webpage
  • Mohammad Reza Zare Shahneh's project webpage
  • Hyuna Kwon's project webpage
  • Tim Day's project webpage
  • Tonglu Dou's project webpage
  • Jinli Zhang's project webpage
  • Mariana Machado Garcez Duarte's project webpage
  • Henry Wu's project webpage
  • Sakshar Chakravarty's project webpage
  • Majid Saeed Ali Saeedan's project webpage
  • Qicheng Hu's project webpage
  • Amirsadra Mohseni's project webpage
  • Amy Boyd's project webpage
  • Daniel Tan's project webpage
  • Varun Chawla's project webpage
  • Yuanbin Cheng's project webpage
  • Yuzhe Ni's project webpage
  • Mehrnaz Ayazi's project webpage
  • Prince Choudhary's project webpage
  • Gontu Abhinav's project webpage
  • Pegah Mirabedini's project webpage
  • Shiyao Feng's project webpage
  • Arman Irani's project webpage
  • Homework

  • Homework 1 (posted Jan 15, due Jan 29, midnight)
  • Homework 2 (posted Jan 29, due Feb 12, midnight)
  • Homework 2 solution
  • Homework 3 (posted Feb 15, due Mar 1, midnight)
  • Homework 3 solution
  • Term paper (to replace the presentation)

  • If you prefer to write a term paper instead of giving a presentation, please read the rules and email me by Friday Feb 5th
  • Presentation

  • Choose a paper among the Proceedings of RECOMB 2020 or RECOMB 2019 or ISMB 2020 and use the sign up sheet to reserve a spot
  • Email me the Powerpoint file the day before the presentation (before 5pm)
  • Give the 15 minutes presentation via zoom (make sure you time it correctly, I will stop you at 15 mins, we will reserve a minimum of 2 minutes for questions)
  • Calendar of Lectures

    Week 1
  • Jan  4: Intro, Molecular Biology (1-9)
  • Jan  6: Molecular Biology (10-28)
  • Jan  8: Molecular Biology (29-47)
  • Week 2
  • Jan 11: Molecular Biology (48-62)
  • Jan 13: Molecular Biology (63-87)
  • Jan 15: Molecular Biology (88-end), Molecular Biology Tools (1-5) [hw1 posted]
  • Week 3
  • Jan 18: Campus holiday
  • Jan 20: Molecular Biology Tools (6-29)
  • Jan 22: Molecular Biology Tools (30-42)
  • Week 4
  • Jan 25: Molecular Biology Tools (43-end)
  • Jan 27: Indexing (1-23)
  • Jan 29: Indexing (24-48)[hw1 due][hw2 posted]
  • Week 5
  • Feb  1: Indexing (49-66)
  • Feb  3: Indexing (67-83)
  • Feb  5: Indexing (84-99)
  • Week 6
  • Feb  8: Indexing (100-end), Probability Models (1-)
  • Feb 10: Probability Models (-)
  • Feb 12: Probability Models (-) [hw2 due]
  • Week 7
  • Feb 15: Campus holiday [hw3 posted]
  • Feb 17: Probability Models (-)
  • Feb 19: Probability Models (-end)
  • Week 8
  • Feb 22: Networks (1-)
  • Feb 24: Networks (-)
  • Feb 26: Networks (-end)
  • Week 9
  • Mar  1: [hw3 due] Presentations (deadline for the PPT file is Feb 28th, 5PM)
  • Mar  3: Presentations (deadline for the PPT file is Mar 2nd, 5PM)
  • Mar  5: Presentations (deadline for the PPT file is Mar 5th, 5PM)
    Week 10
  • Mar  8: Presentations (deadline for the PPT file is Mar 7th, 5PM)
  • Mar 10: Presentations (deadline for the PPT file is Mar 9th, 5PM)
  • Mar 12: Presentations (deadline for the PPT file is Mar 11th, 5PM)
  • Project Demo (20-25 minutes demo, 5-10 minutes questions, via zoom): sign up sheet. Please use the same zoom link that we use for the lectures