CS 167 - Introduction to Big-data

Time: Tuesday & Thursday - 2:00 PM to 3:20 PM

Location: Zoom. Check iLearn for Zoom link

Instructor: Ahmed Eldawy -
Office Hours: Monday & Thursday 11:00 - 11:50 AM (Zoom link on iLearn)

TA: Payas Rajan -
Office Hours: Monday and Wednesday 2:00-3:00 PM (Zoom link on iLearn)

TA (MSOL): Xin Zhang -
Office Hours: Monday and Tuesday 6:00 - 7:30 PM (Zoom link on iLearn)

Syllabus

Textbook: Learning Spark Lightning-Fast Data Analytics (2nd Edition) by Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee.

CS 167 covers the data management and systems aspects of big data platforms such as Hadoop, Spark, and AsterixDB. In this course, you will learn how the data is stored in a distributed file system and how the queries run in parallel. The course will cover the following topics.

  • An overview of big data management systems
  • Distributed big-data storage
  • Programming models in big data (e.g., MapReduce and RDD)
  • Column-based storage and analytics on big data
  • Big spatial data
  • Document Databases
  • Machine learning on big data
  • Big-data Visualization

Grade Breakdown

  • (10%) Active class participation (Quizzes and activities)
  • (15%) Assignments
  • (30%) Labs
  • (15%) Mid-term 1
  • (15%) Mid-term 2
  • (15%) Mid-term 3

Grading Scheme

GradePoints
A+[97,100]
A[92,97[
A-[90,92[
B+[87,90[
B[83,87[
B-[80,83[
C+[77,80[
C[73,77[
C-[70,73[
D+[67,70[
D[63,67[
D-[50,63[
F[0,50[

Schedule

Date Topic Reading  Material
Tuesday 3/30 Introduction to Big Data  Slides
Thursday 4/1 A tour on big-data systems  Slides
Tuesday 4/6 Hadoop Distributed File System (HDFS) HDFS Architecture  Slides
Thursday 4/8 Hadoop Distributed File System (HDFS)  Class Activity
Tuesday 4/13 Big-data Processing  Slides
Thursday 4/15 MapReduce Computation
Tuesday 4/20 Resilient Distributed Datasets (RDD)
Thursday 4/22 Mid-term 1
Tuesday 4/27 Resilient Distributed Datasets (RDD)
Thursday 4/29 Resilient Distributed Datasets (RDD)
Tuesday 5/4 Structured and Semi-structured Data Processing
Thursday 5/6 Storage Formats
Tuesday 5/11 Parquet File Format
Thursday 5/13 Mid-term 2
Tuesday 5/18 Big Spatial Data Processing
Thursday 5/20 NoSQL and Document Data Bases/MongoDB
Tuesday 5/25 Machine Learning Meet Big Data
Thursday 5/27 Machine Learning Meet Big Data
Tuesday 6/1 Big-data Visualization
Thursday 6/3 Mid-term 3
Tuesday 6/10 Final Exam. 7:00 PM - 10:00 PM

Labs

# Topic Due Date Instructions
#1

Assignments

# Topic Due Date PDF
#1
#2
#3
#4
#5