CS 167 - Introduction to Big-data

Time: Tuesday & Thursday - 2:00 PM to 3:20 PM

Location: Zoom. Check iLearn for Zoom link

Instructor: Ahmed Eldawy -
Office Hours: Monday & Thursday 11:00 - 11:50 AM (Zoom link on iLearn)

TA: Payas Rajan -
Office Hours: Monday and Wednesday 2:00-3:00 PM (Zoom link on iLearn)

TA (MSOL): Xin Zhang -
Office Hours: Monday and Tuesday 6:00 - 7:30 PM (Zoom link on iLearn)

Syllabus

Textbook: Learning Spark Lightning-Fast Data Analytics (2nd Edition) by Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee.

CS 167 covers the data management and systems aspects of big data platforms such as Hadoop, Spark, and AsterixDB. In this course, you will learn how the data is stored in a distributed file system and how the queries run in parallel. The course will cover the following topics.

  • An overview of big data management systems
  • Distributed big-data storage
  • Programming models in big data (e.g., MapReduce and RDD)
  • Column-based storage and analytics on big data
  • Big spatial data
  • Document Databases
  • Machine learning on big data
  • Big-data Visualization

Grade Breakdown

  • (10%) Active class participation (Quizzes and activities)
  • (15%) Assignments
  • (30%) Labs
  • (15%) Mid-term 1
  • (15%) Mid-term 2
  • (15%) Mid-term 3

Grading Scheme

GradePoints
A+[97,100]
A[92,97[
A-[90,92[
B+[87,90[
B[83,87[
B-[80,83[
C+[77,80[
C[73,77[
C-[70,73[
D+[67,70[
D[63,67[
D-[50,63[
F[0,50[

Schedule

Date Topic Reading  Material
Tuesday 3/30 Introduction to Big Data  Slides
Thursday 4/1 A tour on big-data systems  Slides
Tuesday 4/6 Hadoop Distributed File System (HDFS) HDFS Architecture  Slides
Thursday 4/8 Hadoop Distributed File System (HDFS)  Class Activity
Tuesday 4/13 Big-data Processing  Slides
Thursday 4/15 MapReduce Computation
Tuesday 4/20 Resilient Distributed Datasets (RDD)
Thursday 4/22 Mid-term 1
Tuesday 4/27 Resilient Distributed Datasets (RDD)
Thursday 4/29 Spark SQL  Slides
Tuesday 5/4 Machine Learning Meets Big Data Intro to ML Basic ML algorithms  Slides
Thursday 5/6 Machine Learning Meets Big Data
Tuesday 5/11 Big Spatial Data  Slides
Thursday 5/13 Mid-term 2
Tuesday 5/18 Big Spatial Data
Thursday 5/20 Semi-structured data storage/Parquet JSON introduction Dremel Made Simple with Parquet  Slides
Tuesday 5/25 NoSQL and Document Data Bases/MongoDB
Thursday 5/27 NoSQL and Document Data Bases/MongoDB  Slides
Tuesday 6/1 LSM Tree, Course Review & Next Steps  Slides
Thursday 6/3 Mid-term 3
Tuesday 6/10 Final Exam. 7:00 PM - 10:00 PM

Labs

# Topic Due Date Instructions
#1 Development Setup 4/5/2021 Instructions
#2 HDFS 4/12/2021 Instructions
#3 MapReduce 4/19/2021 Instructions
#4 Spark Java 4/26/2021 Instructions
#5 Spark Scala 5/3/2021 Instructions
#6 Spark SQL 5/10/2021 Instructions

Assignments

# Topic Due Date
#1 HDFS 4/20/2021
#2 MapReduce 5/4/2021 at 2:00 PM Pacific Time
#3 Spark RDD/SQL Thursday, 5/13/2021 at 2:00 PM Pacific Time (Before class)
#4
#5