The common theme of all projects is Bigdata Analysis using Spark.

Each project will have 3 parts:

1.       Collect data. You may pick any dataset (or combinations of datasets) you like, the only constraint is that it must be at least 5 GB. Examples are:

a.       Crawl web to get Web pages

b.       Social network data. E.g., use Twitter streaming API, Google+ API, Instagram API (check if you are able to get more than 5 GB),

c.       Crawl the Web to get images

d.      Get public dataset,  e.g., from http://archive.ics.uci.edu/ml/datasets.html, tweets, http://www.kaggle.com/, http://data.gov, or other source.  If you get such an existing dataset, then the next two parts of the project should be more sophisticated to compensate for this convenience.

e.       This year I am particularly interested in healthcare and real estate datasets. If you are interested, email me and I can provide datasets of all doctors in the US with their location coordinates, all hospitals in the US with coordinates, or homes for sale in California.     

2.       Preprocess or analyze your data in a distributed (parallel) way using Spark, and store the output in the key-value store Cassandra. E.g.,

a.       Build text index (see http://en.wikipedia.org/wiki/Inverted_index) for Web pages

b.      Locate shapes in images or any other analysis on images. You may use existing source code for image analysis and adapt it to work in Hadoop.

c.       If you data is tabular, compute avg income by zipcode, or other group-by queries (tabular data are unlikely to be more than 50GB so you may want to combine with other data)

d.      Find most popular hashtags in Twitter for every day, or build a spatial index that for each city, has a list of tweets.

e.      Following up on the healthcare and real estate datasets mentioned above, you could: find zipcodes with too few or too many doctors (or even go to the specialties level, e.g., too few dermatologists) or hospitals or homes for sale per capita (you need to get population data per zipcode from US Census web site); zipcodes with higher price per house or per bedroom of a house, or combine  with other public datasets  (e.g., average income per zipcode) to get more interesting results.

3.       Build Web interface (use your favorite web programming framework) to explore the preprocessed or analyzed data. E.g. (corresponding to above preprocessing tasks),

a.       Allow searching pages by keyword. You could use Lucene as back-end, or build a simpler string matching algorithm from scratch.

b.      Search images by shape

c.       Do a simple OLAP-style exploration of the data; view on map avg incomes

d.      View on map most popular hashtags for each city

e.      Display heatmap of #doctors-per-zipcode, #doctors-per-capita, #hospitals-per-capita, avg-home-price, avg-home-price-over-avg-income. By heatmap, we mean that you can zoom on the map and see more fine-grained heatmap.

Deliverables (all in PDF):

Deadline

Deliverable

Description

points

1/18 (dates may be slightly updated)

Form Groups and Project Proposal.

Email to instructor (cc TA)

Each group has three members. If you cannot find partners, email the TA to match you with another student ASAP.

Proposal consists of 1-2 pages describing in detail what you will do for each of the three parts of the project.

5 (based on how interesting and novel it is)

2/8

Part 1: Data Collection.

Hardcopy in lab and submit in iLearn. Also submit zip file with your code (if any) in iLearn.

Report containing Requirements, Design, Implementation, Evaluation (statistics on amount and properties of data, time it took you to collect it,), Screenshots, Contribution of each team group member.

Collect your data, clean it, and store it in the lab servers; report; show your data to the TA in the lab

10

3/1

Part2: Spark Data Processing and store in Cassandra.

Hardcopy in lab and submit in iLearn. Also submit zip file with your code in iLearn.

 

Requirements, Design, Implementation, Evaluation (execution time graphs with varying # Spark nodes, and varying data sizes; compare against single-node implementation), Screenshots, Contribution of each team group member. Discuss your evaluation graphs.

 

Demo to TA in lab.

30

3/15

Demo of Part 3 Demo to TA in lab.  
3/22

Final Report.

Drop-off hardcopy in TA's office hour, and submit in iLearn. Also submit zip file with your code in iLearn.

For Part 3: Requirements, Design, Implementation, Evaluation, Screenshots, Contribution of each team group member.

For final report: Include Parts 1,2,3. Address comments you received in earlier submissions of Parts 1,2.

55 (25 for Part 3, 15 for Part 1 revisions, 15 for Part 2 revisions; if no revisions were requested, you get the 15 points anyways)

 

 

 

Notes:

You will be graded on factors including: the interestingness, the technical challenges, the robustness, the cleanness of your code and documentation, and your presentation.

Additional data sources:

1. ICWSM 2009 Spinn3r Blog Dataset
http://icwsm.cs.umbc.edu/


2. million songs Dataset
http://labrosa.ee.columbia.edu/millionsong/

3. usenet corpus dataset
http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

4. Google books n-grams:
https://aws.amazon.com/datasets/8172056142375670

5. Click datasets:
http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset

6. wikipedia dump:
http://www.mappian.com/blog/hadoop/using-hadoop-to-analyze-the-full-wikipedia-dump-files-using-wikihadoop/