Professor:
Dimitrios Gunopulos (dg@cs.ucr.edu)
TA: Benjamin Arai (barai@cs.ucr.edu)
Class Time/Location: Thursdays, 5:10pm-6pm STAT 2674
Professor Office Hours/Location: By Appointment, or TTh 2pm-3:30pm, EBU II, Room 324
TA Office Hours/Location: By Appointment, EBU II, Room 363
Mailing List: cs179g-dg@lists.cs.ucr.edu
The goal of this class is to give each student a basic understanding of designing and implementing a large database project in a group environment. The project for this quarter will be divided into several phases at which groups will be evaluated and graded. There will also be a final project submission and demonstration during the final class of the quarter (December 1, 2005). The initial phase involves deciding upon a project that meets the requirements of the class, and submitting a “Requirements” document by (October 6, 2005). Once the design document has been approved, each group will implement a spatial analysis tool focusing on spatial indexing and cluster analysis. Each group will be implementing basic spatial indexing structures using a provided API and implementing their own spatial analysis tool on top of this basic API structure.
· October 6, 2005: IEEE design document must be emailed to the Professor or TA and approved before this date.
· October 13; October 27; Nov. 11, 2005: Progress report for each group.
· November 25, 2005: Completed binders are due for each group.
· December 1, 2005, December 8, 2005: Project presentations.
A requirements document is a common tool in software development designed to create a written contract between the developer(s) and client(s). This creates boundaries to protect both parties in the event of miscommunication and gives a clear idea of what is required and expectations from each group. For the purpose of our project a we will adhere to an IEEE design format. A specific document description can be viewed at:
http://www.cmcrossroads.com/bradapp/docs/sdd.html
For the project, this document will serve as a contract between the developers (you) and the client (us). Be careful in designing this document, you will be held to all of the specifications you included. On the other hand if the requirements are not specific or the document does not meet the minimum standards of the class, you will be asked to resubmit rework and resubmit the document.
You will have to describe an application where spatial indexing is applicable. For example, you may want to build an application that can identify which bookstore is closest to your current location. You will also have to find the appropriate datasets (with respect to the application you have chosen) that will help you evaluate the tools that you will build for your project.
The project will pertain to two distinct portions. The first portion of the project is to implement a spatial indexing tool using one of the following API toolsets:
·
C++ Spatial Indexing API, by Marios
Hadjieleftheriou:
http://www.cs.ucr.edu/~marioh/
·
Postgresql GIST:
http://www.sai.msu.su/~megera/postgres/gist/
The above items are among the two most popular open source spatial indexing tools. They each provide a basic structure for running simple nearest neighbor type queries. Most likely you are going to want to use the C++ API by M. Hadjieleftheriou; it offers a rich set of features and is easily extendable. Once you have a working spatial indexing system, you should be able to solve simple queries such as:
“SELECT CLOSEST N NEIGHBORS FROM LOCATION X, Y, Z”
This exercise should help you become familiarized with basic spatial indexing concepts such as R-Trees and nearest neighbor queries. This portion of the project should be implemented using the above spatial indexing tools directly without much effort (You don’t need to make R-Trees and etc). This will be the basis from which you will build a cluster analysis tool for the final phase.
Now that you can locate the nearest neighbor from a specified location it would also be helpful to be able to locate the densest area of data. This has many applications such as congestion control etc. The next portion of the project will require implementation on top of the spatial indexing tool created during the previous phase. First, there needs to be a method/function to find all of the nearest neighbors for all of the points in the database in a single query. This is similar to the sample query given above from the first phase but instead of a single point; you will need to solve the same query for all points simultaneously. If a user asks for all of the nearest neighbors for all of the data points, your application should be able to return a list of every point in the database and its closest neighbors using the spatial indexing tool.
The next portion and probably the most challenging part of the project are to research and find a method for locating the densest area(s) of the dataset. More specifically, in the database every point is some distance from every other point. Your goal is to find the group of points that has the smallest combined distance from each other (cluster), and locate its center location. There should be no input for this function, only output with a single location specifying the center of the cluster.
The binder should contain an all inclusive set of documents containing everything for the project including but not limited to the requirements document, instructions describing how to use the application, notes regarding caveats or issues, and finally any material used or referenced in completing the project. In addition, a CD image of the application and any software used must be included with the binder. An example, project binder will be provided during one of the class sessions.
The presentation for each group is 10-15 minutes. The presentation should give a brief overview of the tools used and any issues that were addressed during the development process. The most important part of the presentation is the live demo of the application and most importantly a description of how your groups solved the spatial analysis problems presented. In addition, each member must describe and explain their contribution to the group.
· Late turn-ins are not accepted. Since the design document is the starting point for the project, late submission are not allowed but documents that need further review will not be penalized for doing so.
· The project will be done by groups of two people.
· The documentation should make it clear how the work was divided in each group.
· All members of the groups must been present for all checkpoint evaluations and presentations (evaluation dates will be posted on the mailing list).
· All students must be subscribed to the mailing list. In the event that a student misses an announcement due to lack of mailing list subscription he/she will receive no leniency.
· Group selections are final. Like “real-life” design projects you will experience in the industry, projects are the responsibilities of all of the developers. In the event that a student is not pulling their own weight it is up to the group members to address and resolve the issues internally.
· Slides describing data clustering approaches are available here here and here. You can also read the chapter on clustering in the book "Data Mining, Concepts and Techniques", by J. Han and M. Kamber.