Data Mining Techniques (CS-235)
Project

You must form groups of two students. If you cannot find a partner, please email Shiwen, who can find you a partner.

There are two types of projects:

A project proposal should be submitted as hard copy in class on 10/17/2012 and be emailed the same day to Shiwen and Vagelis. The project proposal should be 1-2 pages, be in pdf format, and contain the following:
  1. Names of group members
  2. Preferred date for project presentation (see class site for available dates)
  3. Project description

Your project proposal must be approved before you proceed with your project.

Your project grade does not only depend on if you addressed all items in your proposal, but also on the overall complexity and interestingness of your project.

The project deliverables should be submitted as hard copy (except the source code for software projects) in class on 12/5/2012 and be emailed (including source code) the same day to Shiwen and Vagelis

Guidelines for Software projects

In project proposal, you must include what dataset you plan to use, what problem you will solve, and how you will evaluate your solution.

A software project discovers or leverages interesting relationships within a significant amount of data. Best if the project leverages what we have learned in class.

A typical project involves:

1. Selecting one or more datasets, e.g., from http://archive.ics.uci.edu/ml/datasets.html, tweets, http://www.kaggle.com/, http://data.gov, or other source.

2. Define a problem on these data. E.g., if you have a dataset of demographics, you may study what attributes (e.g., income, age, zipcode, race) are correlated, if an existing classifier performs well, if you need to do any special preprocessing of the data, what is the meaning of clustering in the dataset by different clustering algorithms, if there are interesting patterns, how do you handle missing or dirty data, and so on.

3. Solve the problem. If the problem is sufficiently complex (e.g., using multiple datasets or tricky preprocessing or crawling the web to get  the data), then you may use data mining packages (e.g. WEKA). Else, you should implement the data mining algorithms yourself, in any programming language. Make clear in your report what existing software you are using.

4. Evaluate your solution.

 

Project ideas (assuming you are able to find the right datasets):

The deliverables of a software project are:

1. A project report in pdf (file name should contain the last names of all group members), about 10 pages in any format you like, that includes most of the below, plus other material if needed:

2. A zip file with the source code.

 

Guidelines for Survey project

In project proposal, you must include the topic description and the list of papers you will survey.

First, you need to pick an interesting topic related to Data Mining, where there has been adequate amount of research. Use Google Scholar to find the most important papers in this area (look for papers with many citations). Also consider commercial systems or products in your topic.

Select about 5 papers for the survey. The papers selection must be part of the project proposal.

The deliverable is the survey paper in pdf (file name should contain the last names of all group members), which is 12 pages formatted as described in http://www.acm.org/sigs/publications/proceedings-templates

The survey must identify the common and the different characteristics across the papers, and present them in a coherent and integrated way, and not as just one paper per section.

Example of survey topics are: