Project Title: Data Mining Techniques
for Geospatial Applications
P.I: Dimitrios Gunopulos
Department
of Computer Science and Engineering, University of California,
Riverside
Contact
Information:
Dimitrios Gunopulos
Associate Professor
Department of Computer
Science and Engineering
University of California, Riverside,
CA 92521
Phone: (909) 787-2479, Fax : (909) 787-4643
Email: dg@cs.ucr.edu, http://www.cs.ucr.edu/~dg
WWW Page: http://www.cs.ucr.edu/~dg/discovery.html
List of Supported
Students:
Project Award
Information:
Keywords:
Project
Summary:
Publications and
Products:
Journal/Conference
publications: [21]M. Vlachos,
M. Hadjieleftheriou, D. Gunopulos, E. Keogh, "Indexing Multi-Dimensional
Time-Series with Support for Multiple Distance Measures", Proc. of 9th
SIGKDD, 2003. [22]
M. Cardle, M. Vlachos, S. Brooks, E. Keogh, D. Gunopulos, "Fast Motion
Capture Matching with Replicated Motion Editing", Proc. of SIGGRAPH
2003, San Diego, Technical Sketches & Applications [23]
M. Vlachos, J. Lin, E. Keogh, D. Gunopulos, "A Wavelet-Based Anytime
Algorithm for K-Means Clustering of Time Series", Proc. of 3rd SDM 2003,
Workshop on Clustering High Dimensionality Data and Its Applications. [24]
D. Papadopoulos, C. Domeniconi, D. Gunopulos, Sheng Ma, "Clustering Gene
Expression Data in SQL Using Locally Adaptive Metrics", 8th ACM SIGMOD
Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003. [25]
C. Domeniconi, D. Gunopulos, "An Efficient Density-Based Approach for Data
Mining Tasks", Knowledge and Information Systems, 2003, accepted. [26]
Huiwen Wu, Dimitrios Gunopulos, "Evaluating the Utility of Statistical
Phrases and Latent Semantic Indexing for Text Classification", IEEE Int.
Conf. on Data Mining, 2002. [27]
D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivonen, R. Sharma,
"Discovering all most specific sequences", ACM TODS, accepted. [28]
D. Gunopulos, G. Kollios, V. Tsotras, C. Domeniconi, "Selectivity
Estimators for Multi-Dimensional Range Queries over Real Attributes", VLDB
Journal, Accepted. [29]
J. Lin, D. Gunopulos, "Dimensionality Reduction by Random Projection and
Latent Semantic Indexing", Text Mining Workshop, at the 3rd SIAM
International Conference on Data Mining, (2003).
Ph.D, and Master
Thesis, Books: [3]
Carlotta Domeniconi: "Locally Adaptive Techniques for Pattern Classifcation",
Ph.D. Thesis, UCR, 8/2002.
[4]
Michalis Vazirgiannis, Maria Halkidi, Dimitrios Gunopulos, "Quality
Assessment and Uncertainty Handling in Data Mining" Springer-Verlag LNAI
Series, 2003.
[5]
Michail Vlachos, "Similarity and Indexing in Multidimensional
Spaces"
Ph.D. Thesis, UCR, 8/2004
Project
Impact: Outreach Activities: Goals, Objectives, and
Targeted Activities: (i) Density estimators to
approximate geospatial data sets: We investigate two new techniques for
approximating the density of geospatial data sets. These are multi-dimensional
histograms with overlapping buckets of variable size and multi-dimensional
kernel estimators with adaptive bandwidths [20, 17, 6, 28]. Advanced density estimator techniques allow efficient
approximation of range queries on geospatial data, and enable interactive
exploratory data mining. (ii) Adaptive sampling for data mining algorithms: We
will investigate the use of adaptive sampling for batched data mining tasks for
geospatial datasets. In adaptive sampling, dense regions of the space are
sampled at a different frequency than sparse regions. Adaptive sampling requires
an efficient density estimation technique [4, 8]. To address
this issue we plan to use the new density estimation techniques we propose
above. (iii) Similarity models for generalized time series: Spatiotemporal data
describe objects that change over time. We propose new similarity models to
identify objects that change their location in similar patterns [12, 10, 2], and identify regions that show similar
patterns in the change of their attributes [7]. These techniques can be
used for clustering objects or for answering nearest neighbor queries. (iv)
Locally adaptive nearest neighbor classification: In [19, 16, 9, 5, 1, 25, Domeniconi Ph.D.
Thesis] we
show how we can use local information for improving the performance of nearest
neighbor classification techniques on numerical data. (v) Set Indexing: In
[11] we give a technique that solves approximately the set indexing problem:
Given a query set q and a collection of sets S, find the sets in S that are most
similar to q. (vi) Stream Computing: In
[3, 4] we initiate a
study on how to efficiently do range query approximation and how to induce SVM
classifiers on stream data. (vii) Applications of similarity models for
trajectories: In [21] we give new indexing techniques for finding
similar trajectories. We apply such techniques for matching body movements in
[22]. In [24] we give algorithms for clustering gene expression time
sequences. (viii) Text indexing: In
[26] we discuss the effects of using
phrases in combination to dimensionality reduction techniques. In [29] we
compare the effectiveness of different dimensionality reduction techniques for
text retrieval. (ix) Association rules mining: In
[27] we give a
theoretical foundtion on the complexity of the problem of finding all maximal
frequent sets.
Area Background:
We consider two possible
modes of user interaction. In the first, exploratory data mining, the user
takes an exploratory part in the discovery process by (i) defining and
testing hypotheses, (ii) asking queries on the data and (iii) striving
to understand the structure of the data. The simplest example of such a
task is answering range queries. Here the user defines a specific region
he wants to explore, and asks queries to find the characteristics of this
region, for example the number of points in the interior of the region,
or the average value of different attributes in this regions. Another example
of an exploratory data mining task is finding the nearest neighbor (or
nearest neighbors) to a given object.
In the second, batched
data mining the user defines the goal that the data mining technique has
to achieve. Such data mining tasks include clustering, classification,
outlier detection, feature similarity, spatial association rules, trend
detection. Speeding up the execution of such tasks, and scaling the algorithms
to run for large data sets, is one of the most important problems in data
mining research. Some commercial data mining software (for example SAS
Enterprise Miner) employ uniform sampling to allow the user to work interactively
using batched data mining tasks, such as clustering or classification.
To execute faster, the algorithms are run on the (much smaller than the
actual dataset) sample.
It is important to
discover efficient techniques that are suitable for geospatial data because
not only the datasets are large, but also the data mining tasks we wish
to perform are quite complex. For exploratory data mining tasks in particular,
very fast techniques must be provided if the task is to be feasible at
all. To find techniques that are efficient when applied to geospatial data,
the specific characteristics of such data have to be properly exploited.
In
this project we explore techniques for approximating datasets as well as
new approximate or online algorithms.
Area References:
Potential Related
Projects:
Carlotta Domeniconi, Ph.D. 2002, currently Assistant
Professor at
George Mason University.
Michail Vlachos, Ph.D. 2004, currently at IBM T.J. Watson Research Center.
Dimitris Papadopoulos, Ph.D. expected Fall 2004.
M.S. Thesis: Ilias
Tsoukatos, Michail Vlachos,
Sharmila Subramanian.
Award Number: IIS-9984729
Duration: Oct 1, 2000 - Aug.
31, 2004
Query approximation (range
queries on multidimensional data, set indexing), biased sampling, mining
spatiotemporal patterns, similarity of trajectories, locally adaptive
classification, aggregation and classification of stream datasets
The goal of this research is to develop fundamental techniques to allow
efficient and interactive knowledge discovery from large multidimensional
datasets with spatial and temporal attributes. There are two general research
goals. The first involves the investigation of effective density approximation
techniques to approximate very large geospatial datasets. The proposed
techniques are used both to facilitate simple exploratory data mining tasks on
large geospatial datasets, and to efficiently provide accurate approximate
solutions to general data mining tasks, such as clustering, classification and
outlier detection. The second involves designing and implementing new algorithms
and techniques for similarity queries in large datasets. The educational
component of the project aims to develop courses that emphasize the fundamental
ideas in data mining, and introduce students to real data mining problems. The
results of this project will have a significant impact on how large
multidimensional datasets are analyzed, with applications in the fields of
Geographic Information Systems, Epidemiology, and Environmental
research.
[1] C. Domeniconi, D. Gunopulos, "Efficient Local Flexible Nearest
Neighbor Classification", SIAM Int. Conf. on Data Mining (SDM)
2002.
[2] M.
Vlachos, G. Kollios, D. Gunopulos, "Discovering similar trajectories", ICDE
2002.
[3] D.
Zhang, D. Gunopulos, V. Tsotras, B. Seeger, "Temporal Aggregation over Data
Streams using Multiple Granularities", EDBT 2002.
[4] C. Domeniconi, D. Gunopulos,
"Incremental Support Vector Machine Construction", IEEE Int. Conf on Data Mining
(ICDM) 2001.
[5] C. Domeniconi, D. Gunopulos, "Adaptive Nearest Neighbor
Classification using Support Vector Machines", NIPS 2001.
[6] C. Domeniconi, D.
Gunopulos, "An Efficient Approach for Approximating Multi-dimensional
Range Queries and Nearest Neighbor Classification in Large Datasets", ICML
2001.
[7] E.
Tsoukatos, D. Gunopulos, "Efficient Mining of SpatioTemporal Patterns",
Int. Symp. on Spatial and Temporal Databases (SSTD) 2001.
[8] G. Kollios, D.
Gunopulos, N. Koudas and S. Berchtold, "`Efficient Biased Sampling for
Approximate Clustering and Outlier Detection in Large Datasets", accepted in
IEEE Transactions on Knowledge and Data Engineering (TKDE).
[9] C. Domeniconi, J. Peng,
D. Gunopulos, "`Locally Adaptive Metric Nearest Neighbor Classification",
accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
[10]
B. Bollobas, G. Das, D. Gunopulos, H. Mannila, "Time-Series Similarity Problems
and Well-Separated Geometric Sets", Nordic Journal of Computing,
4/2001.
[11]
A. Gionis, D. Gunopulos, N. Koudas: "Efficient and Tunable Similar Set
Retrieval", SIGMOD 2001.
[12] D. Gunopulos, G. Das,
"Time Series Similarity Measures", Tutorial in SIGMOD 2001.
[13] D. Zhang, A.
Markowetz, V. Tsotras, D. Gunopulos, B. Seeger: "Efficient Computation of
Temporal Aggregates with Range Predicates," in PODS
2001.
[14] G.
Kollios, D. Gunopulos, N. Koudas, S. Berchtold: "An efficient
Approximation Scheme for Data Mining Tasks", ICDE 2001.
[15] V. Kalogeraki,
D. Gunopulos: "Managing Multimedia Streams in Distributed Environments
using CORBA", MIS 2000.
[16] C. Domeniconi, J. Peng, D.
Gunopulos: "An adaptive Metric Machine for Pattern Classification",
NIPS 2000.
[17] D. Gunopulos, G. Kollios, V. Tsotras, C. Domeniconi: "Using kernels
to approximate multi-dimensional aggregate range queries over real attributes."
Workshop on New Perspectives in Kernel-Based Learning Methods, NIPS
2000.
[18] P.
Chou, E. Grossman, D. Gunopulos, P. Kamesan: "Identifying Perspective
Customers," in SIGKDD 2000.
[19] C. Domeniconi, J. Peng and D.
Gunopulos: "Adaptive Metric Nearest Neighbor Classification", in Computer
Vision & Pattern Recognition 2000.
[20] Dimitrios Gunopulos, George
Kollios, Vassilis J. Tsotras, Carlotta Domeniconi: "Approximating
Multi-Dimensional Aggregate Range Queries over real Attributes", 2000 ACM
SIGMOD.
[1] Ilias
Tsoukatos: "Mining Spatiotemporal Data", M.S. Thesis, Univ. Of California, Riverside,
February 2001.
[2] Michail Vlachos: "Discovering Similar Trajectories", M.S. Thesis, Univ. Of California, Riverside, 2001.
Carlotta Domeniconi graduated with her Ph.D. degree in Summer 2002. She
is now an Assistant Professor at George Masosn University. Her research involves
the design of techniques for density approximation of multidimensional datasets,
and the theoretical and experimental evaluation of such techniques. Michail
Vlachos got a MS degree in 2001 and graduated with a Ph.D. degree
in Summer 2004.
He is now working at IBM T.J. Watson Research Center.
His research is focused on improving indexing and similarity search
in high-dimensional spaces.
Ilias Tsoukatos graduated with a MS
degree during 2001. For his thesis he designed and implemented a system for
mining spatiotemporal datasets, such as ebvironmental data, and finding frequent
patterns in such datasets. Dimitrios Papadopoulos is expected to earn his Ph.D.
degree in Fall 2004.
Parts of this research are used in a
Data Mining graduate course (CS235) taught at the University of California,
Riverside (Fall 2000). This is the first time this particular course is taught
in the Department.
We have also organized, in collaboration with Dr. Koudas from AT&T
Labs, a 5 day tutorial session that was presented under the Center for Discrete
Mathematics and Theoretical Computer Sciene (DIMACS) 2001-2004 Special Focus on
Data Analysis and Mining, and sponsored by NSF. The tutorial included 16 invited
speakers from industry and academia.
For this project we have established
a cooperation with AT&T Labs. Through this cooperation we expect to
have access to real datasets and applications. We also cooperate with the
Department of Defense.
The project findings are summarized in a dedicated web page:http://www.cs.ucr.edu/~dg/discovery.html
The major findings are published in journals and conferences. We gave two
tutorials in SIGKDD 2000 and SIGMOD 2000 on time sequence similarity measures,
and a tutorial in SSTD 2001 on mining spatiotemporal data.
This proposal researches
techniques that support efficient data mining operations on geospatial data. In
addition it proposes new data mining tasks that can be applied to such data. The
problems we are addressing are the following:
This project researches
basic techniques that support efficient data mining operations in geospatial
data, either in a static environment or in a time evolving environment
(also termed a spatiotemporal environment).
Typically, geospatial
data contain geometric or topological information, in addition to other
kinds of attributes, such as environmental factors or data related to natural
resource exploration. Examples of time evolving geospatial data include
data that describe the evolution of natural phenomena, earth science data
that describe spatiotemporal phenomena in the atmosphere, two- or
three-dimensional sequences of images of geographical regions, or data
that describe the location of individuals in the geographic space as a
function of time. Observation systems, such as radar or satellites, can
generate very large datasets. For example, the NASA Earth Observation System
is expected to transmit 50GB of data per hour. In this proposal we generally
assume that geospatial data describe objects with spatial attributes that
define the objects' location in space, and non-spatial attributes that
define the objects' properties.
[1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen and
A. I. Verkamo, "Fast Discovery of Association Rules", Advances in
Knowledge Discovery and Data Mining, Chapter 12, AAAI/MIT Press, 1995.
[2] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uhturusamy,
editors. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press,
1996.
[3] D. Gunopulos, R. Khardon, H. Mannila and H.
Toivonen: "Data mining, Hypergraph Transversals, and Machine Learning",
in the 16th ACM Symp. on Principles of Database Systems (PODS-1997).
[4] Jiawei Han and Micheline Kamber. Data Mining: Concepts
and Techniques, Morgan Kaufmann Publishers, 2000.
[5] A.K. Jain and R.C. Dubes. Algorithms for Clustering
Data. Prentice Hall, 1988.
[6] D. Scott. Multivariate Density Estimation. Theory,
Practice and Visualization. Wiley & Sons, 1992.
[7] Sholom M. Weiss and Casimir A. Kulikowski. Computer
Systems That Learn: Classification and Prediction Methods from Statistics,
Neural Nets, Machine Learning and Expert Systems. Morgan Kauffmann Publishers.
1990.
We would like to cooperate
with colleagues working on knowledge discovery in databases.