Project Title: Data Mining Techniques
for Geospatial Applications
P.I: Dimitrios Gunopulos
Department
of Computer Science and Engineering, University of California,
Riverside
Contact
Information:
Dimitrios Gunopulos
Associate Professor
Department of Computer
Science and Engineering
University of California, Riverside,
CA 92521
Phone: (909) 787-2479, Fax : (909) 787-4643
Email: dg@cs.ucr.edu, http://www.cs.ucr.edu/~dg
WWW Page: http://www.cs.ucr.edu/~dg/discovery.html
List of Supported
Students:
Carlotta Domeniconi, currently at
George Mason Univiversity. Ilias
Tsoukatos, currently at Adobe.
Michail Vlachos, Dimitris
Papadopoulos, Sharmila Subramanian, Department of
Computer Science and Engineering, University of California, Riverside.
Project Award
Information:
Award Number: IIS-9984729
Duration: Oct 1, 2000 - Aug.
31, 2004
Keywords:
Query approximation (range
queries on multidimensional data, set indexing), biased sampling, mining
spatiotemporal patterns, similarity of trajectories, locally adaptive
classification, aggregation and classification of stream datasets
Project
Summary:
The goal of this research is to develop fundamental techniques to allow
efficient and interactive knowledge discovery from large multidimensional
datasets with spatial and temporal attributes. There are two general research
goals. The first involves the investigation of effective density approximation
techniques to approximate very large geospatial datasets. The proposed
techniques are used both to facilitate simple exploratory data mining tasks on
large geospatial datasets, and to efficiently provide accurate approximate
solutions to general data mining tasks, such as clustering, classification and
outlier detection. The second involves designing and implementing new algorithms
and techniques for similarity queries in large datasets. The educational
component of the project aims to develop courses that emphasize the fundamental
ideas in data mining, and introduce students to real data mining problems. The
results of this project will have a significant impact on how large
multidimensional datasets are analyzed, with applications in the fields of
Geographic Information Systems, Epidemiology, and Environmental
research.
Publications and Products:
Journal/Conference
publications:
[1] C. Domeniconi, D. Gunopulos, "Efficient Local Flexible Nearest
Neighbor Classification", SIAM Int. Conf. on Data Mining (SDM)
2002.
[2] M.
Vlachos, G. Kollios, D. Gunopulos, "Discovering similar trajectories", ICDE
2002.
[3] D.
Zhang, D. Gunopulos, V. Tsotras, B. Seeger, "Temporal Aggregation over Data
Streams using Multiple Granularities", EDBT 2002.
[4] C. Domeniconi, D. Gunopulos,
"Incremental Support Vector Machine Construction", IEEE Int. Conf on Data Mining
(ICDM) 2001.
[5] C. Domeniconi, D. Gunopulos, "Adaptive Nearest Neighbor
Classification using Support Vector Machines", NIPS 2001.
[6] C. Domeniconi, D.
Gunopulos, "An Efficient Approach for Approximating Multi-dimensional
Range Queries and Nearest Neighbor Classification in Large Datasets", ICML
2001.
[7] E.
Tsoukatos, D. Gunopulos, "Efficient Mining of SpatioTemporal Patterns",
Int. Symp. on Spatial and Temporal Databases (SSTD) 2001.
[8] G. Kollios, D.
Gunopulos, N. Koudas and S. Berchtold, "`Efficient Biased Sampling for
Approximate Clustering and Outlier Detection in Large Datasets", accepted in
IEEE Transactions on Knowledge and Data Engineering (TKDE).
[9] C. Domeniconi, J. Peng,
D. Gunopulos, "`Locally Adaptive Metric Nearest Neighbor Classification",
accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
[10]
B. Bollobas, G. Das, D. Gunopulos, H. Mannila, "Time-Series Similarity Problems
and Well-Separated Geometric Sets", Nordic Journal of Computing,
4/2001.
[11]
A. Gionis, D. Gunopulos, N. Koudas: "Efficient and Tunable Similar Set
Retrieval", SIGMOD 2001.
[12] D. Gunopulos, G. Das,
"Time Series Similarity Measures", Tutorial in SIGMOD 2001.
[13] D. Zhang, A.
Markowetz, V. Tsotras, D. Gunopulos, B. Seeger: "Efficient Computation of
Temporal Aggregates with Range Predicates," in PODS
2001.
[14] G.
Kollios, D. Gunopulos, N. Koudas, S. Berchtold: "An efficient
Approximation Scheme for Data Mining Tasks", ICDE 2001.
[15] V. Kalogeraki,
D. Gunopulos: "Managing Multimedia Streams in Distributed Environments
using CORBA", MIS 2000.
[16] C. Domeniconi, J. Peng, D.
Gunopulos: "An adaptive Metric Machine for Pattern Classification",
NIPS 2000.
[17] D. Gunopulos, G. Kollios, V. Tsotras, C. Domeniconi: "Using kernels
to approximate multi-dimensional aggregate range queries over real attributes."
Workshop on New Perspectives in Kernel-Based Learning Methods, NIPS
2000.
[18] P.
Chou, E. Grossman, D. Gunopulos, P. Kamesan: "Identifying Perspective
Customers," in SIGKDD 2000.
[19] C. Domeniconi, J. Peng and D.
Gunopulos: "Adaptive Metric Nearest Neighbor Classification", in Computer
Vision & Pattern Recognition 2000.
[20] Dimitrios Gunopulos, George
Kollios, Vassilis J. Tsotras, Carlotta Domeniconi: "Approximating
Multi-Dimensional Aggregate Range Queries over real Attributes", 2000 ACM
SIGMOD.
[21]M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, E. Keogh, "Indexing Multi-Dimensional Time-Series with Support for Multiple Distance Measures", Proc. of 9th SIGKDD, 2003.
[22] M. Cardle, M. Vlachos, S. Brooks, E. Keogh, D. Gunopulos, "Fast Motion Capture Matching with Replicated Motion Editing", Proc. of SIGGRAPH 2003, San Diego, Technical Sketches & Applications
[23] M. Vlachos, J. Lin, E. Keogh, D. Gunopulos, "A Wavelet-Based Anytime Algorithm for K-Means Clustering of Time Series", Proc. of 3rd SDM 2003, Workshop on Clustering High Dimensionality Data and Its Applications.
[24] D. Papadopoulos, C. Domeniconi, D. Gunopulos, Sheng Ma, "Clustering Gene Expression Data in SQL Using Locally Adaptive Metrics", 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003.
[25] C. Domeniconi, D. Gunopulos, "An Efficient Density-Based Approach for Data Mining Tasks", Knowledge and Information Systems, 2003, accepted.
[26] Huiwen Wu, Dimitrios Gunopulos, "Evaluating the Utility of Statistical Phrases and Latent Semantic Indexing for Text Classification", IEEE Int. Conf. on Data Mining, 2002.
[27] D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivonen, R. Sharma, "Discovering all most specific sequences", ACM TODS, accepted.
[28] D. Gunopulos, G. Kollios, V. Tsotras, C. Domeniconi, "Selectivity Estimators for Multi-Dimensional Range Queries over Real Attributes", VLDB Journal, Accepted.
[29] J. Lin, D. Gunopulos, "Dimensionality Reduction by Random Projection and Latent Semantic Indexing", Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining, (2003).
Ph.D, and Master
Thesis, Books:
[1] Ilias
Tsoukatos: "Mining Spatiotemporal Data", M.S. Thesis, Univ. Of California, Riverside,
February 2001.
[2] Michail Vlachos: "Discovering Similar Trajectories", M.S. Thesis, Univ. Of California, Riverside, 2001.
[3] Carlotta Domeniconi: "Locally Adaptive Techniques for Pattern Classifcation", Ph.D. Thesis, UCR, 8/2002.
[4] Michalis Vazirgiannis, Maria Halkidi, Dimitrios Gunopulos, "Quality Assessment and Uncertainty Handling in Data Mining" Springer-Verlag LNAI Series, 2003.
Project
Impact:
Carlotta Domeniconi graduated with her Ph.D. degree in Summer 2002. She
is now an Assistant Professor at George Masosn University. Her research involves
the design of techniques for density approximation of multidimensional datasets,
and the theoretical and experimental evaluation of such techniques. Michail
Vlachos got a MS degree in 2001 and is currently working on his Ph.D. research.
For his MS thesis he desinged and implemented a technique for efficiently
discovering similar object trajectories. Ilias Tsoukatos graduated with a MS
degree during 2001. For his thesis he designed and implemented a system for
mining spatiotemporal datasets, such as ebvironmental data, and finding frequent
patterns in such datasets. Dimitrios Papadopoulos is expected to earn his MS.
degree in Summer 2003.
Parts of this research are used in a
Data Mining graduate course (CS235) taught at the University of California,
Riverside (Fall 2000). This is the first time this particular course is taught
in the Department.
We have also organized, in collaboration with Dr. Koudas from AT&T
Labs, a 5 day tutorial session that was presented under the Center for Discrete
Mathematics and Theoretical Computer Sciene (DIMACS) 2001-2004 Special Focus on
Data Analysis and Mining, and sponsored by NSF. The tutorial included 16 invited
speakers from industry and academia.
For this project we have established
a cooperation with AT&T Labs. Through this cooperation we expect to
have access to real datasets and applications. We also cooperate with the
Department of Defense.
Outreach Activities:
The project findings are summarized in a dedicated web page:http://www.cs.ucr.edu/~dg/discovery.html
The major findings are published in journals and conferences. We gave two
tutorials in SIGKDD 2000 and SIGMOD 2000 on time sequence similarity measures,
and a tutorial in SSTD 2001 on mining spatiotemporal data.
Goals, Objectives, and
Targeted Activities:
This proposal researches
techniques that support efficient data mining operations on geospatial data. In
addition it proposes new data mining tasks that can be applied to such data. The
problems we are addressing are the following:
(i) Density estimators to approximate geospatial data sets: We investigate two new techniques for approximating the density of geospatial data sets. These are multi-dimensional histograms with overlapping buckets of variable size and multi-dimensional kernel estimators with adaptive bandwidths [20, 17, 6, 28]. Advanced density estimator techniques allow efficient approximation of range queries on geospatial data, and enable interactive exploratory data mining.
(ii) Adaptive sampling for data mining algorithms: We will investigate the use of adaptive sampling for batched data mining tasks for geospatial datasets. In adaptive sampling, dense regions of the space are sampled at a different frequency than sparse regions. Adaptive sampling requires an efficient density estimation technique [4, 8]. To address this issue we plan to use the new density estimation techniques we propose above.
(iii) Similarity models for generalized time series: Spatiotemporal data describe objects that change over time. We propose new similarity models to identify objects that change their location in similar patterns [12, 10, 2], and identify regions that show similar patterns in the change of their attributes [7]. These techniques can be used for clustering objects or for answering nearest neighbor queries.
(iv) Locally adaptive nearest neighbor classification: In [19, 16, 9, 5, 1, 25, Domeniconi Ph.D. Thesis] we show how we can use local information for improving the performance of nearest neighbor classification techniques on numerical data.
(v) Set Indexing: In [11] we give a technique that solves approximately the set indexing problem: Given a query set q and a collection of sets S, find the sets in S that are most similar to q.
(vi) Stream Computing: In [3, 4] we initiate a study on how to efficiently do range query approximation and how to induce SVM classifiers on stream data.
(vii) Applications of similarity models for trajectories: In [21] we give new indexing techniques for finding similar trajectories. We apply such techniques for matching body movements in [22]. In [24] we give algorithms for clustering gene expression time sequences.
(viii) Text indexing: In [26] we discuss the effects of using phrases in combination to dimensionality reduction techniques. In [29] we compare the effectiveness of different dimensionality reduction techniques for text retrieval.
(ix) Association rules mining: In [27] we give a theoretical foundtion on the complexity of the problem of finding all maximal frequent sets.
Area Background:
This project researches basic techniques that support efficient data
mining operations in geospatial data, either in a static environment or in a
time evolving environment (also termed a spatiotemporal environment). Typically,
geospatial data contain geometric or topological information, in addition to
other kinds of attributes, such as environmental factors or data related to
natural resource exploration. Examples of time evolving geospatial data include
data that describe the evolution of natural phenomena, earth science data that
describe spatiotemporal phenomena in the atmosphere, two- or
three-dimensional sequences of images of geographical regions, or data that
describe the location of individuals in the geographic space as a function of
time. We consider two possible modes of user interaction. In the first,
exploratory data mining, the user takes an exploratory part in the discovery
process by (i) defining and testing hypotheses, (ii) asking queries on the data
and (iii) striving to understand the structure of the data. Examples of
such tasks are answering range queries, or finding the nearest neighbor (or
nearest neighbors) to a given object. In the second, batched data mining the
user defines the goal that the data mining technique has to achieve. Such data
mining tasks include clustering, classification, outlier detection, feature
similarity, spatial association rules, trend detection. Speeding up the
execution of such tasks, and scaling the algorithms to run for large data sets,
is one of the most important problems in data mining research.