next_inactive up previous


CS235 Fall 2001:
Data Mining Techniques

Instructor

Dimitrios Gunopulos, dg@cs.ucr.edu
Office hours: TTh 3:30pm-5:0pm

Reading Material

The following books are optional, most of the topics will be covered by papers and handouts.
  1. ``Data Mining: Concepts and Techniques,'' J. Han and M. Kamber, Morgan Kaufmann, San Fransisco, 2000.
  2. ``Machine Learning,'' Tom Mitchell, McGraw Hill, 1997.
  3. ``Principles of Data Mining (Adaptive Computation and Machine Learning),'' H. Mannila, P. Smyth, D. Hand, MIT Press 2000.

Course objective

Data mining has emerged as one of the most exciting fields in Computer Science. Today many organizations and commercial enterprises have large online archives of data available, and these archives may contain unknown, yet useful, information. Data mining refers to a set of techniques that have been designed to find interesting pieces of information or knowledge in large amounts of data. There is currently a large commercial interest in the area, both for the development of data mining software and for the offering of consulting services on data mining, with a market for the former estimated at over 5 billion dollars. In this course we explore how this interdisciplinary field brings together techniques from databases, statistics and machine learning. We will discuss the main data mining methods currently used, including clustering, classification, association rules mining, time series clustering, and web mining. Designing algorithms for these tasks is difficult because the input data sets are very large, and the tasks may be very complex. One of the main focuses in the field is to integrate these algorithms with relational databases, and we examine the additional complications that come up in this case.

Grading Method

Class participation, one project, midterm, final exam.

Syllabus

  1. Overview
  2. Databases and OLAP (On Line Analytical Processing).
  3. Clustering: Hierarchical and partitional approaches.
  4. Classification: Decision Trees, Neural Nets, Naive Bayes classifiers.
  5. Association rules mining, sequential patterns.
  6. Time series similarity, time series clustering.
  7. Deviation detection.
  8. Advanced topics: Incremental mining, scalability, parallel data mining algorithms, web mining.

List of papers for CS235

  1. Association Rules
    1. ``Mining Frequent Patterns without Candidate Generation'', J. Han, J. Pei, and Y. Yin, Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00), Dallas, TX, May 2000. http://db.cs.sfu.ca/sections/publication/kdd/kdd.html
    2. ``Efficiently Mining Long Patterns from Databases'', R. J. Bayardo Jr., Proc. of the ACM SIGMOD Conference on Management of Data, Seattle, Washington, 85-93, June 1998. www.almaden.ibm.com/cs/quest/PUBS.html
    3. ``Data mining, Hypergraph Transversals, and Machine Learning'', D. Gunopulos, R. Khardon, H. Mannila, H. Toivonen, 16th ACM Symp. on Principles of Database Systems (PODS), Tuscon, AZ, 1997. www.almaden.ibm.com/cs/quest/PUBS.html
    4. ``Fast Algorithms for Mining Association Rules'', R. Agrawal, R. Srikant, Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. www.almaden.ibm.com/cs/quest/PUBS.html
  2. Classification
    1. ``Models and selection criteria for regression and Classification'', D. Heckerman and C. Meek, MSR-TR-97-08, www.research.microsoft.com/research/dtg/hecherma/hecherma.html
    2. ``Boosting and Naive Bayesian Learning'', C. Elkan, www-cse.ucsd.edu/users/elkan
    3. ``MDL-based Decision Tree Pruning'', M. Mehta, R. Agrawal, and J. Rissanen, KDD 95, www.almaden.ibm.com/cs/people/ragrawal/pubs.html
    4. ``Rainforest - A framework for fast decision tree construction of Large Datasets'', J. Gehrke, Raghu R., V. Ganti, VLDB 98, www.almaden.ibm.com/cs/people/ragrawal/pubs.html
    5. ``Adaptive Metric Nearest Neighbor Classification.'' C. Domeniconi, J. Peng, D. Gunopulos, Computer Vision and Pattern Recognition 2000. dblab.cs.ucr.edu/publications.html
  3. Clustering
    1. ``Cure: an efficient clustering algorithm for large databases'', S.Guha, R. Rastogi, K. Shim, SIGMOD 98, www.bell-labs.com/user/rastogi
    2. ``BIRCH: An Efficient Data Clustering Method for Very Large Databases'', T. Zhang, Raghu R., M. Livny, SIGMOd 96, www.cs.wisc.edu/ raghu/raghu.html
    3. `` Efficient and Effective Clustering Method for Spatial Data Mining'', R. Ng and J. Han, Proc. of 1994 Int'l Conf. on Very Large Data Bases (VLDB'94), Santiago, Chile, September 1994, http://db.cs.sfu.ca/sections/publication/kdd/kdd.html
    4. ``Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,'' R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, in Proc. of 1998 SIGMOD , Seattle, WA, 94-105. dblab.cs.ucr.edu/publications.html
  4. Web indexing and classification
    1. ``Scalable Feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies'', S. Chakrabarti, B. Dom, R. Agrawal and P. Raghavan, VLDB Journal 98, www.almaden.ibm.com/cs/k53/ir.html
    2. ``Enhanced hypertext categorization using hyperlinks'', S. Chakrabarti, B. Dom, P. Indyk, Sigmod 1998, www.almaden.ibm.com/cs/k53/ir.html
    3. ``Inferring Web Communities from Link Topologies'', D. Gibson, J. Kleinberg, P. Raghavan, ACM Hypertext 1998, www.almaden.ibm.com/cs/k53/ir.html
    4. ``Authoritative Sources in a hyperlinked environment'', J. Kleinberg, SODA 1998, www.almaden.ibm.com/cs/k53/ir.html
    5. ``Improved Algorithms for topic distillation in a hyperlinked environment'', K. Bharat and M. Henzinger, SIGIR 98, www.research.digital.com/SRC/staff/bharat/bib.html
    6. ``Focused crawling: a new approach to topic-specific web resource discovery.'' S. Chakrabarti, M. van den Berg, and B. Dom. Computer Networks, 31:1623-1640, 1999. First appeared in the 8th International World Wide Web Conference, 1999. http://www8.org/w8-papers/5a-search-query/crawling/index.html
    7. ``Athena: Mining-based Interactive Management of Text Databases,'' R. Agrawal, R. J. Bayardo Jr. and R. Srikant. IBM Research Report RJ10153, July 1999. (To appear in Proc. of EDBT-2000.) http://www.almaden.ibm.com/cs/quest/PUBS.html
  5. Sequence similarity and sequential patterns
    1. ``Mining Sequential Patterns'', R. Srikant, R. Agrawal, Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996. www.almaden.ibm.com/cs/quest/publications.html
    2. ``Efficient Enumeration of Frequent Sequences,'' Mohammed Zaki, CIKM'98. http://www.cs.rpi.edu/ zaki/papers.html
    3. `` PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth '', J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001 http://db.cs.sfu.ca/sections/publication/kdd/kdd.html
    4. ``Fast Time Sequence Indexing for Arbitrary Lp Norms,'' Byoung-Kee Yi and Christos Faloutsos, VLDB 2000, Cairo, Egypt, Sept. 10-14, 2000 http://www.cs.cmu.edu/ christos/cpub.html
    5. ``Time-Series Similarity Problems and Well-Separated Geometric Sets,'' B. Bollobás, G. Das, D. Gunopulos and H. Mannila, in the 13th ACM Symp. on Computational Geometry, 1997, Nice, France. dblab.cs.ucr.edu/publications.html
  6. Sampling and dataset approximation
    1. ``On Near-Uniform URL Sampling'', Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher, Marc Najork, 9th International World Wide Web Conference, 2000. http://www9.org/w9cdrom/88/88.html
    2. ``Approximating Multi-Dimensional Aggregate Range Queries Over Real Attributes.' ' D. Gunopulos, G. Kollios, V. Tsotras and C. Domeniconi in the 19th ACM SIGMOD Conf., Dallas, 2000, dblab.cs.ucr.edu/publications.html

Projects for CS235

You will have to define and propose a project for this class by the 4th week of classes. The project definition should have sufficient detail to give a clear idea of what you want to do, and how you plan to do it (including implementation detail). In addition you will have to find the right data to test your final project. You will be graded and you will get partial credit on the following:
  1. The quality and detail of your proposal.
  2. Finding the right test datasets, and putting them into an easy to use format.
  3. The implementation of your proposal.
  4. A final report (approximately the size of a workshop paper) that explains what you have done, and the experimental results that you obtained.
The deadline for the project will be the last week of classes. You can work in teams of 2.

List of possible project ideas for CS235

  1. Implement algorithms for finding maximal frequent sets.
  2. Comparison of subspace clustering techniques.
  3. Implement frequent sequential pattern algorithms for mining bio-sequences.
  4. Cluster genes using gene expression data.
  5. Design and implementation of selectivity estimators for range queries.
  6. Indexing time series using the Longest Common Subsequence similarity measure.
  7. Analyze network intrusion data.
  8. Build a Bayesian classifier for web documents.
  9. Build a clustering algorithm for documents.
  10. Compare dimensionality reduction techniques for documents.

About this document ...

This document was generated using the LaTeX2HTML translator Version 99.2beta8 (1.42)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 syllabus2001.tex

The translation was initiated by Dimitrios Gunopulos on 2001-10-25


next_inactive up previous
Dimitrios Gunopulos 2001-10-25