CS 144: Algorithms for Bioinformatics

Spring 2021

Overview

An unprecedented wealth of data is being generated by large genome/metagenome/epigenetic projects and other efforts to determine the structure and function of molecular biological systems. This technical elective will focus on a selection of algorithms and data structures aimed at the analysis of biomolecular data. In other words, CS 144 is a Data Science class oriented at the analysis of biomolecular data.

Catalog Description

  • Introduces fundamental algorithms and data structures for solving analytical problems in molecular biology and genomics. Includes exact and approximate string matching; sequence alignment; genome assembly; and gene and regulatory motifs recognition.
  • Note: Credit is awarded for one of the following CS 144, CS 234, or CS 238.
  • Prerequisites

  • CS 141
  • Solid programming experience (ideally Python)
  • Some basic notions of probability and statistics
  • No biology background is assumed
  • Instructors

  • Stefano Lonardi, email, office MRB 3130
  • Saleh Sereshki, email, office MRB
  • Class Meeting

  • TR 9:30am-10:50am
  • Zoom ID: 918 6420 0042, password will be emailed to you (or login into Canvas and check the announcements)
  • Head to Yuja for the recordings
  • Discussions

  • W 10am-10:50am
  • Zoom ID: 952 6605 3109, same password as the lecture
  • Head to Yuja for the recordings
  • Office hours

  • Stefano: Thursday 11-12noon via Zoom ID 918 6420 0042, same password as the lecture
  • Saleh: Wednesday 11-12noon via Zoom ID 952 6605 3109, same password as the lecture
  • Discussion Forum

    We will use a Discord server for discussion and questions about CS 144 (and beyond -- religion and politics excluded). The forum will be moderated by the instructor and the TA who will respond to questions, but students are encouraged to help each other via discussion. However, assignment specifics should not be discussed. You will receive an invite to the Discord server via email. If you have joined this class later, please check Canvas. Please be respectful.

    Required Texbook

  • Bioinformatics Algorithms (UC Riverside edition) by Phillip Compeau (CMU) and Pavel Pevzner (UCSD), 2020. Companion website
  • Preliminary list of topics

  • Intro to molecular and computational biology, including biotech tools (slides)
  • Space-efficient data structures for sequences
  • Short read mapping (suffix tries/trees, suffix arrays, B-W transform)
  • Sequence alignment (global and local), linear space, multiple
  • Genome assembly, overlap graphs, de Bruijn graphs
  • Hidden Markov models, Profile HMM, Viterbi and Baum-Welch learning
  • Motif finding and Gibbs sampling
  • Construction of evolutionary trees (phylogeny)
  • Course Format

  • Seven/eight individual homework to be developed on JupyterLab (50% of the grade)
  • One programming project (details TBA) (50% of the grade)
  • Cheating

    We will not tolerate any kind of cheating in this course. Homework and final project are to be completed on your own. The only external sources allowed are those mentioned above or by the instructor throughout the course. If you have a doubt or question, please just ASK. As per standard UCR policy, you may not submit answers (written or programming) to problem sets that contain material you did not produce yourself for the express purpose of this offering of this course. If I find that you have submitted work that is not your own or is work you submitted in a different course, I will assign you a zero on that assignment (and possibly a zero on the entire course, depending on the severity), and I will forward the case to Student Conduct and Academic Integrity Programs for campus-level consideration.

    Late work

    Each student is granted five "late days" which can be used (in integer units) on any of the homework. If a more dire situation arises, please contact the instructor.

    Slides

    Slides will be posted on Canvas.

    Grades

    Grades will be posted on Canvas.

    Homework

    Homework (in the form of Python notebooks) will be released on Sundays on Canvas (go to Assignments), and they will be due the following Sunday at 11:59pm. Download these Python notebooks on your computer, then upload them into JupyterLab. Homework will have to be completed using CS department’s Juypter Hub server at https://locus.cs.ucr.edu/. Submit your Python notebook on Canvas by the due date. Solutions will be posted on Canvas.

    Calendar

    Week 1
  • Tuesday, Mar 30: Intro, Molecular Biology
  • Thursday, Apr 1: Molecular Biology
  • Sunday, Apr 4: [hw1 posted]
  • Week 2
  • Tuesday, Apr 6: Molecular Biology
  • Thursday, Apr 8: Read Mapping
  • Sunday, Apr 11: [hw1 due], [hw2 posted]
  • Week 3
  • Tuesday, Apr 13: Read Mapping
  • Thursday, Apr 15: Read Mapping
  • Sunday, Apr 18: [hw2 due], [hw3 posted]
  • Week 4
  • Tuesday, Apr 20: Read Mapping, Sequence Alignment
  • Thursday, Apr 22: Projects
  • Sunday, Apr 25: [hw3 due], [hw4 posted]
  • Week 5
  • Tuesday, Apr 27: Sequence Alignment
  • Thursday, Apr 29: Sequence Alignment, Genome Assembly
  • Sunday, May 2: [hw4 due]
  • Week 6
  • Tuesday, May 4: Genome Assembly
  • Thursday, May 6: Genome Assembly, HMM
  • Sunday, May 9: [hw5 posted]
  • Week 7
  • Tuesday, May 11: HMM
  • Thursday, May 13: HMM
  • Sunday, May 16: [hw5 due], [hw6 posted]
  • Week 8
  • Tuesday, May 18: HMM, Motif finding
  • Thursday, May 20: Motif finding
  • Sunday, May 23: [hw6 due], [hw7 posted]
  • Week 9
  • Tuesday, May 25: Motif finding, Evolutionary trees
  • Thursday, May 27: Evolutionary trees
  • Sunday, May 30: [hw7 due]
  • Week 10
  • Tuesday, Jun 1: Evolutionary trees, Concluding remarks
  • Thursday, Jun 3: CANCELLED
  • Finals' Week
  • Project demo
  • Additional References

  • (HMMs) Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
  • (Suffix Trees) Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997.
  • (Algorithms) Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002
  • (Algorithms) Neil C. Jones and Pavel Pevzner, An Introduction to Bioinformatics Algorithms, MIT Press, 2004
  • (Algorithms) Marketa Zvelebil, Jeremy O. Baum, Understanding Bioinformatics, Garland Science, 2007
  • Additional resources

  • Learn how to Fold it! A great game about protein folding that can help the scientific community
  • RNAi animation (Nature Genetics)
  • DNA Molecular animation
  • DNA interactive
  • Genomic Data Science Specialization (Coursera)
  • Bioconductor for Genomic Data Science (Coursera)
  • Genome Sequencing (Bioinformatics II) (Coursera)
  • Introduction to Genomics (NHGRI)
  • Fundamentals of Biology (on-line course)
  • Pevzner's bioinformatics courses (Coursera)