CS 226 - Big-Data Management

Time: Monday, Wednesday & Friday - 10:00 AM to 10:50 AM

Zoom Link: Check on iLearn

Instructor: Ahmed Eldawy - - Office Hours: Mon, Wed - 11:00-11:50 AM
Zoom Link: Check on iLearn

TA: Akil Sevim - Office Hours: TBA

Syllabus

CS 226 covers the data management and systems aspects of big data platforms such as Hadoop, Spark, and AsterixDB. In this course, you will learn how the data is stored in a distributed file system and how the queries run in parallel. The course will cover the following topics.

  • An overview of big data management systems
  • Distributed storage of big data
  • Programming models in big data (e.g., MapReduce and RDD)
  • Packages for big data analysis (e.g., SparkSQL, MLlib, and SparkR)
  • An overview of key-value stores (e.g., MongoDB)
  • Big SQL systems (e.g., AsterixDB, Impala, and SparkSQL)

Grade Breakdown

  • (5%) Class active partitipation
  • (10%) Reading and review assignments
  • (20%) Programming assignments
  • (15%) Mid-term
  • (50%) Project
    • Group selection - Due on Monday, 10/12/2020 (Week 2)
    • (5%) One-page proposal - Due on Monday, 10/26/2020 (Week 4)
    • (10%) Project proposal presentation - Due on Friday, 11/06/2020 (Week 5)
    • (10%) Literature survey - Due on Monday, 11/09/2020 (Week 6)
    • (5%) Report outline - Due on Monday, 11/16/2020 (Week 7)
    • Report draft (Optional! A chance to get an early feedback) - Due on Wednesday, 11/25/2020 (Week 8)
    • (10%) Final report including the deliverables - Due on Friday, 12/11/2020 (Week 10)
    • (10%) Final presentation - TBD (Finals week)

Schedule

Date Topic Reading (Before class)  Material
Fri 10/02 Introduction to big data Slides
Mon 10/05 Introduction to big data (cont'd) The Age of Data Analytics
Wed 10/07 Introduction to Spark Slides
Fri 10/09 Introduction to SparkSQL and MLlib Slides
Mon 10/12 Overview of Big Spatial Data Group Selection Slides
Wed 10/14 Hadoop Distributed File System (HDFS) HDFS Architecture Slides
Fri 10/16 HDFS
Mon 10/19 HDFS
Wed 10/21 HDFS
Fri 10/23 Spark Resilient Distributed Datasets (RDD) Slides
Mon 10/26 Spark Resilient Distributed Datasets (RDD) One page proposal due
Wed 10/28 Spark Resilient Distributed Datasets (RDD)
Fri 10/30
Mon 11/02 Spark RDD Operations Slides
Wed 11/04 Spark RDD Operations
Fri 11/06 Midterm exam (on iLearn)
Mon 11/09 Spark-SQL Slides
Literature survey due
Wed 11/11 Happy Veterans Day! No class!
Fri 11/13 Spark SQL
Mon 11/16 Storage and Indexing Slides
Report outline due
Wed 11/18
Fri 11/20
Mon 11/23
Wed 11/25 NoSQL Slides
Report draft (Optional)
Fri 11/27 Happy Thanksgiving holiday! No class!
Mon 11/30 Document Databases / MongoDB
Wed 12/02 AsterixDB - Big Data Management Systems Slides
Midterm exam (on iLearn)
Fri 12/04 Apache Spark MLlib Slides
Mon 12/07
Wed 12/09 Topics not Covered Slides
Midterm exam (on iLearn)
Fri 12/11 What's Next Slides
Final report due

Assignments

# Topic Due Date PDF
#1

Project Ideas

  • Build a web interface that allows users to search in big data (e.g., Health records, census data, ... etc.)
  • Collect tweets and use them to run some correlation analysis or sentiment analysis, e.g., how do people in different states perceive brands (car brands, food brands, ... etc.)
  • Download satellite data and find the correlation between temperature, vegetation, percipitation, and fires. For example, you can compute the average per day/week/month/season/year and show how the averages change over time.
  • Collect census data, POI data, lakes, parks, ... etc. and try to rank cities in the US by their quality of life.
  • Build a 3D road network and visualize it on Google Earth. Collect the road network from OSM and adjust it with a Digital Elevation Model (DEM) to assign an altitude to it.
  • Extract a clean dataset from OpenStreetMap OSM, i.e., a dataset that can be directly used in applications.
  • Build an efficient sampler for text files in Hadoop. A one that is faster than the existing one. For example, it could iteratively read bigger samples until some statistical measure is met. Build a map of images that shows an image for each region and the regions change as we zoom in/out.
  • Build an interactive web visualizer that visualizes the functional map of the world (FMOW) which consists of satellite images and annotated regions (e.g., buildings or parks). The instructor already obtained the full dataset and it can be shared with the project group members. Find more details on the following link https://www.iarpa.gov/index.php/working-with-iarpa/prize-challenges/1015-functional-map-of-the-world-fmow
  • Build an interactive 3-D visualization for a given geospatial dataset using the existing 3-D visualization packages like Cesium, I3S or three.js. Feel free to choose any other 3-D visualizer if you wish to. The input dataset can be obtained from us or from open source environment like https://www.data.gov/
  • In all existing geospatial visualizations like Google Maps, Bing Maps or HadoopViz, the spatial datasets are preprocessed into small tiles and generated into maps. All of these systems work with static datasets (data sets that are constant and not updated periodically). If some new points are added to the input dataset, in order to visualize it, we need to rebuild the tiles from scratch. Build a similar system with dynamic dataset that can help us view geospatial points on maps for a data stream like Tweeter data. The system should be able to incorporate the new datapoints that were added to input dataset without having to reconstruct the whole thing.

Reviews

Please read the following articles and provide a brief summary as directed below.

Due Date Article
Monday, October 5th The Age of Analytics (Executive Summary)
Provide a review with the following parts in no more than 500 words in total.
  • A 150-word summary of the article.
  • The main contributions of the article.
  • Five points in the article that you either liked or disliked.
Wednesday, October 14th The architecture of HDFS

Further readings

  1. Wentian Guo, Yuchen Li, Mo Sha, Kian-Lee Tan: Parallel Personalized Pagerank on Dynamic Graphs. PVLDB: 93-106 (2017)
  2. Mo Sha, Yuchen Li, Bingsheng He, Kian-Lee Tan: Accelerating Dynamic Graph Analytics on GPUs. PVLDB: 107 - 120 (2017)
  3. Fabian Fier, Nikolaus Augsten, Panagiotis Bouros, Ulf Leser, Johann-Christoph Freytag: Set Similarity Joins on MapReduce: An Experimental Survey. PVLDB: 1110 - 1122 (2018).
  4. Ramon Antonio Rodriges Zalipynis: ChronosDB: Distributed, File Based, Geospatial Array DBMS. PVLDB: 1247 - 1261 (2018)
  5. Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiane-Ruiz, Nan Tang, Saravanan Thirumuruganathan, Anis Troudi: RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! - PVLDB 1414 - 1427 (2018).
  6. Zainab Abbas, Vasiliki Kalavri, Paris Carbone, Vladimir Vlassov: Streaming Graph Partitioning: An Experimental Study. PVLDB 1590 - 1603 (2018)
  7. Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, Sheng Wang: Efficient Distributed Memory Management with RDMA and Caching. PVLDB 1604 - 1617 (2018). Assigned to Yanting Liu
  8. Søren Jensen, Torben Pedersen, Christian Thomsen: ModelarDB: Modular Model-Based Time Series Management with Spark and Cassandra. PVLDB 1688 - 1701. (2018)
  9. Matthias Boehm, Berthold Reinwald, Dylan Hutchison, Prithviraj Sen, Alexandre Evfimievski, Niketan Pansare: On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML. PVLDB 1755 - 1768. (2018)
  10. Gabriela Jacques-Silva, Ran Lei, Luwei Cheng, Guoqiang Jerry Chen, Kuen Ching, Tanji Hu, Yuan Mei, Kevin Wilfong, Rithin Shetty, Serhat Yilmaz, Anirban Banerjee, Benjamin Heintz, Shridar Iyer, Anshul Jaiswal: Providing Streaming Joins as a Service at Facebook. PVLDB 1809 - 1821 (2018)
  11. Bart Samwel, John Cieslewicz, Ben Handy, Jason Govig, Petros Venetis, Chanjun Yang, Keith Peters, Jeff Shute, Daniel Tenedorio, Himani Apte, Felix Weigel, David Wilhite, Jiacheng Yang, Jun Xu, Jiexing Li, Zhan Yuan, Craig Chasseur, Qiang Zeng, Ian Rae, Anurag Biyani, Andrew Harn, Yang Xia, Andrey Gubichev, Amr ElHelw, Orri Erling, Zhepeng Yan, Mohan Yang, Yiqun Wei, Thanh Do, Colin Zheng, Goetz Graefe, Somayeh Sardashti, Ahmed M. Aly, Divy Agrawal, Ashish Gupta, Shiv Venkataraman: F1 Query: Declarative Querying at Scale. (PVLDB) 1835 - 1848 (2018)
  12. Disheng Qiu, Luciano Barbosa, Valter Crescenzi, Paolo Merialdo, Divesh Srivastava. Big Data Linkage for Product Specification Pages SIGMOD Pages: 67-81 (2018)
  13. Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, Daniel Lemire. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources SIGMOD Pages: 221-230 (2018)
  14. Badrish Chandramouli, Guna Prasaad, Donald Kossmann, Justin Levandoski, James Hunter, Mike Barnett. FASTER: A Concurrent Key-Value Store with In-Place Updates. SIGMOD, Pages: 275-290 (2018) Assigned to Aakanksha Patel
  15. Seongyun Ko, Wook-Shin Han. TurboGraph++: A Scalable and Fast Graph Analytics System. SIGMOD Pages: 395-410 (2018)
  16. Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis. The Case for Learned Index Structures. SIGMOD Pages: 489-504 (2018)
  17. Mohiuddin Abdul Qader, Shiwen Cheng, Vagelis Hristidis. A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases. SIGMOD Pages: 551-566 (2018)
  18. Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, Matei Zaharia. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark SIGMOD Pages: 601-613 (2018)
  19. Zeyuan Shang, Guoliang Li, Zhifeng Bao. DITA: Distributed In-Memory Trajectory Analytics. SIGMOD Pages: 725-740 (2018)
  20. Ben Vandiver, Shreya Prasad, Pratibha Rana, Eden Zik, Amin Saeidi, Pratyush Parimal, Styliani Pantela, Jaimin Dave. Eon Mode: Bringing the Vertica Columnar Database to the Cloud. SIGMOD Pages: 797-809 (2018)
  21. Ildar Absalyamov, Michael J. Carey, Vassilis J. Tsotras. Lightweight Cardinality Estimation in LSM-based Systems SIGMOD Pages: 841-855 (2018)
  22. Hwanjun Song, Jae-Gil Lee. RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning. SIGMOD Pages: 1173-1187 (2018)
  23. Maaz Bin Safeer Ahmad, Alvin Cheung. Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications. SIGMOD Pages: 1205-1220 (2018)
  24. Ryan Marcus, Olga Papaemmanouil, Sofiya Semenova, Solomon Garber. NashDB: An End-to-End Economic Method for Elastic Database Fragmentation, Replication, and Provisioning. SIGMOD Pages: 1253-1267 (2018)
  25. Ge, Hancheng, Kai Zhang, Majid Alfifi, Xia Hu, and James Caverlee. "DisTenC: A Distributed Algorithm for Scalable Tensor Completion on Spark." ICDE 2018
  26. Vaibhav Arora, Faisal Nawab, Divyakant Agrawal, Amr El Abbadi: Janus: A Hybrid Scalable Multi-Representation Cloud Datastore. IEEE Trans. Knowl. Data Eng. 30(4): 689-702 (2018) Assigned to Shanshan Liu
  27. Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan, Kevin Lai, Shuo Wu, Sandeep Govind Dhoot, Abhilash Rajesh Kumar, Ankur Agiwal, Sanjay Bhansali, Mingsheng Hong, Jamie Cameron, Masood Siddiqi, David Jones, Jeff Shute, Andrey Gubarev, Shivakumar Venkataraman, Divyakant Agrawal: Mesa: a geo-replicated online data warehouse for Google's advertising system. Commun. ACM 59(7): 117-125 (2016)
  28. Miguel Liroz-Gistau, Reza Akbarinia, Divyakant Agrawal, Patrick Valduriez: FP-Hadoop: Efficient processing of skewed MapReduce jobs. Inf. Syst. 60: 69-84 (2016)
  29. Zhengkui Wang, Yan Chu, Kian-Lee Tan, Divyakant Agrawal, Amr El Abbadi: HaCube: Extending MapReduce for Efficient OLAP Cube Materialization and View Maintenance. DASFAA (2) 2016: 113-129
  30. Divy Agrawal, Sanjay Chawla, Ahmed K. Elmagarmid, Zoi Kaoudi, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Mohammed J. Zaki: Road to Freedom in Big Data Analytics. EDBT 2016: 479-484
  31. Sarath Lakshman, Sriram Melkote, John Liang, Ravi Mayuram: Nitro: A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index. PVLDB 9(13): 1413-1424 (2016)
  32. Peng Lu, Gang Chen, Beng Chin Ooi, Hoang Tam Vo, Sai Wu: ScalaGiST: Scalable Generalized Search Trees for MapReduce Systems . PVLDB 7(14): 1797-1808 (2014)
  33. Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica: Discretized streams: fault-tolerant streaming computation at scale. SOSP 2013: 423-438
  34. Jorge-Arnulfo Quiané-Ruiz, Christoph Pinkel, Jörg Schad, Jens Dittrich: RAFTing MapReduce: Fast recovery on the RAFT. ICDE 2011: 589-600
  35. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, Ion Stoica: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012: 15-28
  36. Botong Huang, Nicholas W. D. Jarrett, Shivnath Babu, Sayan Mukherjee, Jun Yang: Cumulon: Matrix-Based Data Analytics in the Cloud with Spot Instances. PVLDB 9(3): 156-167 (2015) Assigned to Lily Li
  37. Jack Chen, Samir Jindel, Robert Walzer, Rajkumar Sen, Nika Jimsheleishvilli, Michael Andrews: The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database. PVLDB 9(13): 1401-1412 (2016)
  38. Konstantinos Kloudas, Rodrigo Rodrigues, Nuno M. Preguiça, Margarida Mamede: PIXIDA: Optimizing Data Parallel Jobs in Wide-Area Data Analytics. PVLDB 9(2): 72-83 (2015)
  39. Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd D. Millstein, Tyson Condie: Titian: Data Provenance Support in Spark. PVLDB 9(3): 216-227 (2015)
  40. Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magdalena Balazinska, Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, Stanley B. Zdonik: The BigDAWG Polystore System. SIGMOD Record 44(2): 11-16 (2015)
  41. Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, Fatma Özcan: Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics. PVLDB 8(13): 2110-2121 (2015)
  42. Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak R. Borkar, Yingyi Bu, Michael J. Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis J. Tsotras, Rares Vernica, Jian Wen, Till Westmann: AsterixDB: A Scalable, Open Source BDMS. PVLDB 7(14): 1905-1916 (2014)
  43. Sattam Alsubaiee, Alexander Behm, Vinayak R. Borkar, Zachary Heilbron, Young-Seok Kim, Michael J. Carey, Markus Dreseler, Chen Li: Storage Management in AsterixDB. PVLDB 7(10): 841-852 (2014)
  44. E. Preston Carman Jr., Till Westmann, Vinayak R. Borkar, Michael J. Carey, Vassilis J. Tsotras: A scalable parallel XQuery processor. Big Data 2015: 164-173
  45. Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, Ion Stoica: GraphX: Graph Processing in a Distributed Dataflow Framework. OSDI 2014: 599-613
  46. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski: Pregel: a system for large-scale graph processing. SIGMOD Conference 2010: 135-146
  47. Taewoo Kim, Wenhai Li, Alexander Behm, Inci Cetindil, Rares Vernica, Vinayak R. Borkar, Michael J. Carey, Chen Li: Supporting Similarity Queries in Apache AsterixDB. EDBT 2018: 528-539
  48. Christina Pavlopoulou, E. Preston Carman Jr., Till Westmann, Michael J. Carey, Vassilis J. Tsotras: A Parallel and Scalable Processor for JSON Data. EDBT 2018: 576-587
  49. Michael J. Carey, Steven Jacobs, Vassilis J. Tsotras: Breaking BAD: a data serving vision for big active data. DEBS 2016: 181-186
  50. Lauro Lins, James T. Klosowski, and Carlos Scheidegger. Nanocubes for Real-Time Exploration of Spatiotemporal Datasets. Visualization and Computer Graphics, IEEE Transactions on 19, no. 12 (2013): 2456-2465.
  51. Zhicheng Liu, Biye Jiang and Jeffrey Heer. imMens:Real Time Visual Querying of Big Data. Eurographics Conference on Visualization (EuroVis) 2013, Volume 32, Number 3