CS235 Fall 2001:
Data Mining Techniques
Dimitrios Gunopulos, dg@cs.ucr.edu
Office hours: TTh 3:30pm-5:0pm
The following books are optional, most of the topics will be covered
by papers and handouts.
- ``Data Mining: Concepts and Techniques,''
J. Han and M. Kamber, Morgan Kaufmann, San Fransisco, 2000.
- ``Machine Learning,''
Tom Mitchell, McGraw Hill, 1997.
- ``Principles of Data Mining (Adaptive Computation and Machine Learning),''
H. Mannila, P. Smyth, D. Hand, MIT Press 2000.
Data mining has emerged as one of the most exciting fields in
Computer Science.
Today many organizations and commercial enterprises have large
online archives of data available, and these archives may contain
unknown, yet useful, information.
Data mining refers to a set of techniques that have been designed
to find interesting pieces of information or knowledge in large
amounts of data.
There is currently a large commercial interest in the area, both
for the development of data mining software and for the offering
of consulting services on data mining, with a market for the
former estimated at over 5 billion dollars.
In this course we explore how this interdisciplinary field brings
together techniques from databases, statistics and machine
learning.
We will discuss the main data mining methods currently used,
including clustering, classification, association rules mining,
time series clustering, and web mining.
Designing algorithms for these tasks is difficult because the
input data sets are very large, and the tasks may be very complex.
One of the main focuses in the field is to integrate these
algorithms with relational databases, and we examine the
additional complications that come up in this case.
Class participation, one project, midterm, final exam.
- Overview
- Databases and OLAP (On Line Analytical Processing).
- Clustering: Hierarchical and partitional approaches.
- Classification: Decision Trees, Neural Nets, Naive Bayes classifiers.
- Association rules mining, sequential patterns.
- Time series similarity, time series clustering.
- Deviation detection.
- Advanced topics: Incremental mining, scalability, parallel data mining algorithms, web mining.
- Association Rules
- ``Mining Frequent Patterns without Candidate Generation'',
J. Han, J. Pei, and Y. Yin,
Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00), Dallas,
TX, May 2000.
http://db.cs.sfu.ca/sections/publication/kdd/kdd.html
- ``Efficiently Mining Long Patterns from Databases'',
R. J. Bayardo Jr.,
Proc. of the ACM SIGMOD Conference on Management of Data, Seattle,
Washington, 85-93, June 1998.
www.almaden.ibm.com/cs/quest/PUBS.html
- ``Data mining, Hypergraph Transversals, and Machine Learning'',
D. Gunopulos, R. Khardon, H. Mannila, H. Toivonen,
16th ACM Symp. on Principles of Database Systems (PODS), Tuscon, AZ, 1997.
www.almaden.ibm.com/cs/quest/PUBS.html
- ``Fast Algorithms for Mining Association Rules'',
R. Agrawal, R. Srikant,
Proc. of the 20th Int'l Conference on
Very Large Databases, Santiago, Chile, Sept. 1994.
www.almaden.ibm.com/cs/quest/PUBS.html
- Classification
- ``Models and selection criteria for regression and Classification'',
D. Heckerman and C. Meek,
MSR-TR-97-08,
www.research.microsoft.com/research/dtg/hecherma/hecherma.html
- ``Boosting and Naive Bayesian Learning'', C. Elkan,
www-cse.ucsd.edu/users/elkan
- ``MDL-based Decision Tree Pruning'', M. Mehta, R. Agrawal, and
J. Rissanen, KDD 95,
www.almaden.ibm.com/cs/people/ragrawal/pubs.html
- ``Rainforest - A framework for fast decision tree construction
of Large Datasets'',
J. Gehrke, Raghu R., V. Ganti, VLDB 98,
www.almaden.ibm.com/cs/people/ragrawal/pubs.html
- ``Adaptive Metric Nearest Neighbor Classification.''
C. Domeniconi, J. Peng, D. Gunopulos,
Computer Vision and Pattern Recognition 2000.
dblab.cs.ucr.edu/publications.html
- Clustering
- ``Cure: an efficient clustering algorithm for large databases'',
S.Guha, R. Rastogi, K. Shim, SIGMOD 98,
www.bell-labs.com/user/rastogi
- ``BIRCH: An Efficient Data Clustering Method for Very Large Databases'',
T. Zhang, Raghu R., M. Livny, SIGMOd 96,
www.cs.wisc.edu/ raghu/raghu.html
- `` Efficient and Effective Clustering Method for Spatial Data Mining'',
R. Ng and J. Han,
Proc. of 1994 Int'l Conf. on Very Large Data Bases (VLDB'94),
Santiago, Chile, September 1994,
http://db.cs.sfu.ca/sections/publication/kdd/kdd.html
- ``Automatic Subspace Clustering of High Dimensional
Data for Data Mining Applications,''
R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan,
in Proc. of 1998 SIGMOD ,
Seattle, WA, 94-105.
dblab.cs.ucr.edu/publications.html
- Web indexing and classification
- ``Scalable Feature selection, classification and signature
generation for organizing large text databases into hierarchical
topic taxonomies'', S. Chakrabarti, B. Dom, R. Agrawal
and P. Raghavan, VLDB Journal 98,
www.almaden.ibm.com/cs/k53/ir.html
- ``Enhanced hypertext categorization using hyperlinks'',
S. Chakrabarti, B. Dom, P. Indyk, Sigmod 1998,
www.almaden.ibm.com/cs/k53/ir.html
- ``Inferring Web Communities from Link Topologies'',
D. Gibson, J. Kleinberg, P. Raghavan, ACM Hypertext 1998,
www.almaden.ibm.com/cs/k53/ir.html
- ``Authoritative Sources in a hyperlinked environment'',
J. Kleinberg, SODA 1998,
www.almaden.ibm.com/cs/k53/ir.html
- ``Improved Algorithms for topic distillation in a hyperlinked environment'',
K. Bharat and M. Henzinger,
SIGIR 98,
www.research.digital.com/SRC/staff/bharat/bib.html
- ``Focused crawling: a new approach to topic-specific web resource discovery.''
S. Chakrabarti, M. van den Berg, and B. Dom.
Computer Networks, 31:1623-1640, 1999.
First appeared in the 8th International World Wide Web Conference, 1999.
http://www8.org/w8-papers/5a-search-query/crawling/index.html
- ``Athena: Mining-based Interactive Management of Text Databases,''
R. Agrawal, R. J. Bayardo Jr. and R. Srikant.
IBM Research Report RJ10153, July 1999. (To appear in Proc. of EDBT-2000.)
http://www.almaden.ibm.com/cs/quest/PUBS.html
- Sequence similarity and sequential patterns
- ``Mining Sequential Patterns'',
R. Srikant, R. Agrawal, Fifth Int'l Conference on Extending
Database Technology (EDBT), Avignon, France, March 1996.
www.almaden.ibm.com/cs/quest/publications.html
- ``Efficient Enumeration of Frequent Sequences,''
Mohammed Zaki, CIKM'98.
http://www.cs.rpi.edu/ zaki/papers.html
- `` PrefixSpan: Mining Sequential Patterns Efficiently by
Prefix-Projected Pattern Growth '',
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu,
Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001
http://db.cs.sfu.ca/sections/publication/kdd/kdd.html
- ``Fast Time Sequence Indexing for Arbitrary Lp Norms,''
Byoung-Kee Yi and Christos Faloutsos,
VLDB 2000, Cairo, Egypt, Sept. 10-14, 2000
http://www.cs.cmu.edu/ christos/cpub.html
- ``Time-Series Similarity Problems and Well-Separated Geometric Sets,''
B. Bollobás, G. Das, D. Gunopulos and H. Mannila, in
the 13th ACM Symp. on Computational Geometry, 1997, Nice, France.
dblab.cs.ucr.edu/publications.html
- Sampling and dataset approximation
- ``On Near-Uniform URL Sampling'',
Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher, Marc Najork,
9th International World Wide Web Conference, 2000.
http://www9.org/w9cdrom/88/88.html
- ``Approximating Multi-Dimensional Aggregate Range Queries Over Real Attributes.'
'
D. Gunopulos, G. Kollios, V. Tsotras and C. Domeniconi in
the 19th ACM SIGMOD Conf., Dallas, 2000,
dblab.cs.ucr.edu/publications.html
You will have to define and propose a project for this
class by the 4th week of classes.
The project definition should have sufficient detail
to give a clear idea of what you want to do, and how
you plan to do it (including implementation detail).
In addition you will have to find the right data to
test your final project.
You will be graded and you will get partial
credit on the following:
- The quality and detail of your proposal.
- Finding the right test datasets, and putting them
into an easy to use format.
- The implementation of your proposal.
- A final report (approximately the size
of a workshop paper) that explains what you have done,
and the experimental results that you obtained.
The deadline for the project will be the last week of classes.
You can work in teams of 2.
- Implement algorithms for finding maximal frequent sets.
- Comparison of subspace clustering techniques.
- Implement frequent sequential pattern algorithms
for mining bio-sequences.
- Cluster genes using gene expression data.
- Design and implementation of selectivity estimators for
range queries.
- Indexing time series using the Longest Common Subsequence
similarity measure.
- Analyze network intrusion data.
- Build a Bayesian classifier for web documents.
- Build a clustering algorithm for documents.
- Compare dimensionality reduction techniques for documents.
This document was generated using the
LaTeX2HTML translator Version 99.2beta8 (1.42)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 syllabus2001.tex
The translation was initiated by Dimitrios Gunopulos on 2001-10-25
Dimitrios Gunopulos
2001-10-25