CS 226 - Big-Data Management

Time: Tuesday & Thursday - 9:40 AM to 11:00 AM

Location: WCH 142

Instructor: Ahmed Eldawy - - Office Hours: 357 WCH - Tuesday & Thursday 11:00 AM - 12:00 PM

Syllabus

CS 226 covers the data management and systems aspects of big data platforms such as Hadoop, Spark, and AsterixDB. In this course, you will learn how the data is stored in a distributed file system and how the queries run in parallel. The course will cover the following topics.

  • An overview of big data management systems
  • Distributed storage of big data
  • Programming models in big data (e.g., MapReduce and RDD)
  • Packages for big data analysis (e.g., SparkSQL, MLlib, and SparkR)
  • An overview of key-value stores
  • Big SQL systems (e.g., AsterixDB, Impala, and SparkSQL)

Grade Breakdown

  • (10%) Class partitipation
  • (15%) Individual presentation
  • (5%) Reading assignments
  • (15%) Programming assignments
  • (55%) Project
    • (5%) Proposal
    • (10%) Literature survey
    • (5%) Report outline
    • (20%) Final report including the deliverables
    • (15%) Final presentation and overall work

Schedule

Date Topic Reading (Before class)  Material
Tue 01/09 Introduction to big data  Slides
Thu 01/11 Hadoop Overview 1. McKinsey big-data report (Review due on 1/16)
2. Hadoop single-node setup (Before class)
3. Maven setup instructions (Before class)
 Slides
 Hadoop Instructions
Tue 01/16 HDFS K. Shvackho et alThe Hadoop Distributed File System, MSST 2010. doi 10.1109/MSST.2010.5496972
HDFS Architecture
 Slides
 Work sheet
 Work sheet answers
Thu 01/18 MapReduce Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150  Slides
Tue 01/23 MapReduce cont'd  Slides
Thu 01/25 MapReduce examples Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: "Pig latin: a not-so-foreign language for data processing." SIGMOD Conference 2008: 1099-1110  In-class questions
 Source code
 Sample log file #1
 Sample log file #2
 Sample log file #3
 Pig instructions
 Pig examples
Tue 01/30 Resilient Distributed Datasets (RDD) Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, Ion Stoica: "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing." NSDI 2012: 15-28
Thu 02/01 Spark RDD Examples RDD programming guide In-class questions
In-class code
Spark instructions
Tue 02/06 Spark RDD Examples (cont'd) Spark instructions
Thu 02/08 Spark SQL M. Armbrust et al. Spark SQL: Relational Data Processing in Spark. SIGMOD 2015 DOI: https://doi.org/10.1145/2723372.2742797
Tue 02/13
Thu 02/15
Tue 02/20 Jack Chen et al: The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database. PVLDB 9(13): 1401-1412 (2016) Presented by Mugdha Kedar Patil
Ahmed Eldawy et al: SHAHED: A MapReduce-based system for querying and visualizing spatio-temporal satellite data. ICDE 2015: 1585-1596 Presented by Samriddhi Singla
Kathy Lee et al. 2011. Twitter Trending Topic Classification. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops (ICDMW '11). IEEE Computer Society, Washington, DC, USA, 251-258. DOI=http://dx.doi.org/10.1109/ICDMW.2011.171 Presented by Gautham Mani
Matthias Böhm et al: SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull. 37(3): 52-62 (2014) Presented by Al Amin Hossain
Thu 02/22 Manoochehr Ghiassi et al. "Targeted twitter sentiment analysis for brands using supervised feature engineering and the dynamic architecture for artificial neural networks." Journal of Management Information Systems 33.4 (2016): 1034-1058.Presented by Yi Zhu
Apoorv Agarwal et al. "Sentiment analysis of twitter data." Proceedings of the workshop on languages in social media. Association for Computational Linguistics, 2011.Presented by Jakapun Tachaiya
Doug Beaver et al. "Finding a Needle in Haystack: Facebook's Photo Storage." OSDI. Vol. 10. No. 2010. 2010.Presented by Harish Gonnabattula
C. Sabottke et al. "Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits."" In USENIX Security Symposium (pp. 1041-1056). Presented by Ali Davanian
Hypertable Architecture Overview Presented by Abhignana Kandepu
Tue 02/27 Shivaram Venkataraman et al.. "SparkR: Scaling R Programs with Spark." SIGMOD 2016. 1099-1104. DOI: https://doi.org/10.1145/2882903.2903740Presented by Abhay Singh
Mehran Bozorgi et al. "Beyond heuristics: learning to classify vulnerabilities and predict exploits." SIGKDD 2010, pp. 105-114.Presented byYuanlai Liu
Joseph E. Gonzalez et al. "GraphX: Graph Processing in a Distributed Dataflow Framework" OSDI 2014: 599-613 Presented by Siddharth Arun
Jorge-Arnulfo Quiané-Ruiz et al. RAFTing MapReduce: Fast recovery on the RAFT. ICDE 2011: 589-600Presented by Gisel Bastidas Guacho
Thu 03/01 Reynold S. Xin et al, Shark: SQL and Rich Analytics at Scale Presented by Vishnu Chandrasekar
Mihai Christodorescu et al.. "Mining specifications of malicious behavior." In Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering (ESEC-FSE '07). ACM, New York, NY, USA, 5-14. DOI=http://dx.doi.org/10.1145/1287624.1287628 Presented by Tsung-Ying Chen
Joseph E. Gonzalez et al, PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, OSDI 2012 Presented by Vishalsurya Madhavan
Presented by Kaicheng Shou
Xiangrui Meng et al. "MLlib: Machine Learning in Apache Spark." Journal of Machine Learning Research 17: 34:1-34:7 (2016) Presented by Krupa Hegde
Tue 03/06 Rares Vernica et al, "Efficient parallel set-similarity joins using MapReduce," SIGMOD 2010. Presented by Shitong Zhu
Tim Kraska et al, "MLbase: A Distributed Machine-learning System." CIDR 2013.Presented by Kevin Tang
Wei Yin et al, "Scalable regression tree learning on hadoop using openplanet." In Proceedings of third international workshop on MapReduce and its Applications Date, pp. 57-64. ACM, 2012. Presented by Achyuth Diwakar
Michael E. Payne et al, Managing the Academic Data Lifecycle: A Case Study of HPCC, IEEE International Conference on Big Data, 2014 Presented by Shweti Mahajan
B. M. Wilamowski et al, Big data and deep learning, 2016 IEEE 20th Jubilee International Conference on Intelligent Engineering Systems (INES), Budapest, 2016, pp. 11-16, DOI: 10.1109/INES.2016.7555103Presented by Rutuja Gurav
Thu 03/08 Gupta, Bhumika, et al. "Study of Twitter Sentiment Analysis using Machine Learning Algorithms on Python." International Journal of Computer Applications 165.9 (2017). Presented by Fei Yi
B. Bahmani et al, "Scalable k-means++." VLDB 2012. doi>10.14778/2180912.2180915 Presented by A. B. Siddique
Seyedmehdi Hosseinimotlagh and Evangelos E. Papalexakis. "Unsupervised Content-Based Identification of Fake News Articles with Tensor Decomposition Ensembles." WSDM 2018 MIS2: Misinformation and Misbehavior Mining on the Web Workshop (2018). Presented by Nan Zhang
Abdulmalik Alhathlool
Avery Ching, et al. "One trillion edges: Graph processing at facebook-scale." VLDB (2015): 1804-1815. Presented by Amruta Sawant
Tue 03/13
Thu 03/15
Thu 03/22 8:00 AM - 11:00 AM Final Exam

Project Ideas

  • Build a web interface that allows users to search in big data (e.g., Health records, census data, ... etc.)
  • Collect tweets and use them to run some correlation analysis or sentiment analysis, e.g., how do people in different states perceive brands (car brands, food brands, ... etc.)
  • Download satellite data and find the correlation between temperature, vegetation, percipitation, and fires. For example, you can compute the average per day/week/month/season/year and show how the averages change over time.
  • Collect census data, POI data, lakes, parks, ... etc. and try to rank cities in the US by their quality of life.
  • Build a 3D road network and visualize it on Google Earth. Collect the road network from OSM and adjust it with a Digital Elevation Model (DEM) to assign an altitude to it.
  • Extract a clean dataset from OpenStreetMap OSM, i.e., a dataset that can be directly used in applications.
  • Build an efficient sampler for text files in Hadoop. A one that is faster than the existing one. For example, it could iteratively read bigger samples until some statistical measure is met. Build a map of images that shows an image for each region and the regions change as we zoom in/out.
  • Build an interactive web visualizer that visualizes the functional map of the world (FMOW) which consists of satellite images and annotated regions (e.g., buildings or parks). The instructor already obtained the full dataset and it can be shared with the project group members. Find more details on the following link https://www.iarpa.gov/index.php/working-with-iarpa/prize-challenges/1015-functional-map-of-the-world-fmow

Reading Material

Date Material
1/9/2018Nicolaus Henke, Jacques Bughin, Michael Chui, James Manyika, Tamim Saleh, Bill Wiseman, Guru Sethupathy, “The Age of Analytics: Competing in a Data Driven World (Executive Summary).” McKinesy & Compnay, December 2016.
1/12/2018K. Shvachko, H. Kuang, S. Radia and R. Chansler, "The Hadoop Distributed File System," 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, 2010, pp. 1-10. doi: 10.1109/MSST.2010.5496972
1/25/2018Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig latin: a not-so-foreign language for data processing." SIGMOD Conference 2008: 1099-1110
1/30/2018Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, Ion Stoica: "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing."NSDI 2012: 15-28
2/08/2018 Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1383-1394. DOI: https://doi.org/10.1145/2723372.2742797
2/15/2018 Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak R. Borkar, Yingyi Bu, Michael J. Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis J. Tsotras, Rares Vernica, Jian Wen, Till Westmann: AsterixDB: A Scalable, Open Source BDMS. PVLDB 7(14): 1905-1916 (2014)

Further readings

  • Sarath Lakshman, Sriram Melkote, John Liang, Ravi Mayuram: Nitro: A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index. PVLDB 9(13): 1413-1424 (2016)
  • Peng Lu, Gang Chen, Beng Chin Ooi, Hoang Tam Vo, Sai Wu: ScalaGiST: Scalable Generalized Search Trees for MapReduce Systems . PVLDB 7(14): 1797-1808 (2014)
  • Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica: Discretized streams: fault-tolerant streaming computation at scale. SOSP 2013: 423-438
  • Jorge-Arnulfo Quiané-Ruiz, Christoph Pinkel, Jörg Schad, Jens Dittrich: RAFTing MapReduce: Fast recovery on the RAFT. ICDE 2011: 589-600
  • Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, Ion Stoica: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012: 15-28
  • Botong Huang, Nicholas W. D. Jarrett, Shivnath Babu, Sayan Mukherjee, Jun Yang: Cumulon: Matrix-Based Data Analytics in the Cloud with Spot Instances. PVLDB 9(3): 156-167 (2015)
  • Jack Chen, Samir Jindel, Robert Walzer, Rajkumar Sen, Nika Jimsheleishvilli, Michael Andrews: The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database. PVLDB 9(13): 1401-1412 (2016)
  • Konstantinos Kloudas, Rodrigo Rodrigues, Nuno M. Preguiça, Margarida Mamede: PIXIDA: Optimizing Data Parallel Jobs in Wide-Area Data Analytics. PVLDB 9(2): 72-83 (2015)
  • Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd D. Millstein, Tyson Condie: Titian: Data Provenance Support in Spark. PVLDB 9(3): 216-227 (2015)
  • Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magdalena Balazinska, Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, Stanley B. Zdonik: The BigDAWG Polystore System. SIGMOD Record 44(2): 11-16 (2015)
  • Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, Fatma Özcan: Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics. PVLDB 8(13): 2110-2121 (2015)
  • Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak R. Borkar, Yingyi Bu, Michael J. Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis J. Tsotras, Rares Vernica, Jian Wen, Till Westmann: AsterixDB: A Scalable, Open Source BDMS. PVLDB 7(14): 1905-1916 (2014)
  • Sattam Alsubaiee, Alexander Behm, Vinayak R. Borkar, Zachary Heilbron, Young-Seok Kim, Michael J. Carey, Markus Dreseler, Chen Li: Storage Management in AsterixDB. PVLDB 7(10): 841-852 (2014)
  • E. Preston Carman Jr., Till Westmann, Vinayak R. Borkar, Michael J. Carey, Vassilis J. Tsotras: A scalable parallel XQuery processor. Big Data 2015: 164-173
  • Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, Ion Stoica: GraphX: Graph Processing in a Distributed Dataflow Framework. OSDI 2014: 599-613
  • Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski: Pregel: a system for large-scale graph processing. SIGMOD Conference 2010: 135-146