National Science Foundation

Project Title: Data Mining Techniques for Geospatial Applications

P.I: Dimitrios Gunopulos
Department of Computer Science and Engineering, University of California, Riverside

 

Contact Information:
Dimitrios Gunopulos
Associate  Professor
Department of Computer Science and Engineering
University of California, Riverside, CA 92521
Phone: (909) 787-2479, Fax : (909) 787-4643
Email:  dg@cs.ucr.edu, http://www.cs.ucr.edu/~dg

 

WWW Page: http://www.cs.ucr.edu/~dg/discovery.html

 

List of Supported Students:
Carlotta Domeniconi, currently at George Mason Univiversity.  Ilias Tsoukatos, currently at  Adobe.
Michail Vlachos, Dimitris Papadopoulos, Sharmila Subramanian, Department of Computer Science and Engineering, University of California, Riverside.

 

Project Award Information:
Award Number:  IIS-9984729
Duration:  Oct 1, 2000 - Aug. 31, 2004

 

Keywords:
Query approximation (range queries on multidimensional data, set indexing), biased sampling, mining spatiotemporal patterns, similarity of  trajectories, locally adaptive classification, aggregation and classification of stream datasets

 

Project Summary:
The goal of this research is to develop fundamental techniques to allow efficient and interactive knowledge discovery from large multidimensional datasets with spatial and temporal attributes. There are two general research goals. The first involves the investigation of effective density approximation techniques to approximate very large geospatial datasets. The proposed techniques are used both to facilitate simple exploratory data mining tasks on large geospatial datasets, and to efficiently provide accurate approximate solutions to general data mining tasks, such as clustering, classification and outlier detection. The second involves designing and implementing new algorithms and techniques for similarity queries in large datasets. The educational component of the project aims to develop courses that emphasize the fundamental ideas in data mining, and introduce students to real data mining problems. The results of this project will have a significant impact on how large multidimensional datasets are analyzed, with  applications in the fields of Geographic Information Systems, Epidemiology, and Environmental research.

 

Publications and Products:

Journal/Conference publications:
 [1] C. Domeniconi, D. Gunopulos, "Efficient Local Flexible Nearest Neighbor Classification", SIAM Int. Conf. on Data Mining (SDM) 2002.
 [2] M. Vlachos, G. Kollios, D. Gunopulos, "Discovering similar trajectories", ICDE 2002.
 [3] D. Zhang, D. Gunopulos, V. Tsotras, B. Seeger, "Temporal Aggregation over Data Streams using Multiple Granularities", EDBT 2002.
 [4] C. Domeniconi, D. Gunopulos, "Incremental Support Vector Machine Construction", IEEE Int. Conf on Data Mining (ICDM) 2001.
 [5] C. Domeniconi, D. Gunopulos, "Adaptive Nearest Neighbor Classification using Support Vector Machines", NIPS 2001.
 [6] C. Domeniconi, D. Gunopulos,  "An Efficient Approach for Approximating Multi-dimensional Range Queries and Nearest Neighbor Classification in Large Datasets", ICML 2001.
 [7] E. Tsoukatos, D. Gunopulos, "Efficient Mining of SpatioTemporal Patterns",  Int. Symp. on Spatial and Temporal Databases (SSTD) 2001.
 [8] G. Kollios, D. Gunopulos, N. Koudas and S. Berchtold, "`Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Datasets", accepted in IEEE Transactions on Knowledge and Data Engineering (TKDE).
 [9] C. Domeniconi, J. Peng, D. Gunopulos, "`Locally Adaptive Metric Nearest Neighbor Classification", accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
 [10] B. Bollobas, G. Das, D. Gunopulos, H. Mannila, "Time-Series Similarity Problems and Well-Separated Geometric Sets", Nordic Journal of Computing, 4/2001.
 [11] A. Gionis, D. Gunopulos, N. Koudas: "Efficient and Tunable Similar Set Retrieval",  SIGMOD 2001.
 [12] D. Gunopulos, G. Das,  "Time Series Similarity Measures",  Tutorial in SIGMOD 2001.
 [13] D. Zhang, A. Markowetz, V. Tsotras, D. Gunopulos, B. Seeger:  "Efficient Computation of Temporal Aggregates with Range Predicates,"  in  PODS 2001.
 [14] G. Kollios, D. Gunopulos, N. Koudas, S. Berchtold:  "An efficient Approximation Scheme for Data Mining Tasks",  ICDE 2001.
 [15] V. Kalogeraki,  D. Gunopulos:  "Managing Multimedia Streams in Distributed Environments using CORBA",  MIS 2000.
 [16] C. Domeniconi, J. Peng, D. Gunopulos:  "An adaptive Metric Machine for Pattern Classification",  NIPS 2000.
 [17] D. Gunopulos, G. Kollios, V. Tsotras, C. Domeniconi: "Using kernels to approximate multi-dimensional aggregate range queries over real attributes." Workshop on New Perspectives in Kernel-Based Learning Methods, NIPS 2000.
 [18] P. Chou, E. Grossman, D. Gunopulos, P. Kamesan:  "Identifying Perspective Customers," in  SIGKDD 2000.
 [19] C. Domeniconi, J. Peng and D. Gunopulos: "Adaptive Metric Nearest Neighbor Classification",  in Computer Vision   & Pattern Recognition 2000.
 [20] Dimitrios Gunopulos, George Kollios, Vassilis J. Tsotras, Carlotta Domeniconi: "Approximating Multi-Dimensional Aggregate Range Queries over real Attributes",  2000 ACM SIGMOD.

 [21]M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, E. Keogh, "Indexing Multi-Dimensional Time-Series with Support for Multiple Distance Measures", Proc. of 9th SIGKDD, 2003.

 [22] M. Cardle, M. Vlachos, S. Brooks, E. Keogh, D. Gunopulos, "Fast Motion Capture Matching with Replicated Motion Editing", Proc. of SIGGRAPH 2003, San Diego, Technical Sketches & Applications

 [23] M. Vlachos, J. Lin, E. Keogh, D. Gunopulos, "A Wavelet-Based Anytime Algorithm for K-Means Clustering of Time Series", Proc. of 3rd SDM 2003, Workshop on Clustering High Dimensionality Data and Its Applications.

 [24] D. Papadopoulos, C. Domeniconi, D. Gunopulos, Sheng Ma, "Clustering Gene Expression Data in SQL Using Locally Adaptive Metrics", 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003.

 [25] C. Domeniconi, D. Gunopulos, "An Efficient Density-Based Approach for Data Mining Tasks", Knowledge and Information Systems, 2003, accepted.

 [26] Huiwen Wu, Dimitrios Gunopulos, "Evaluating the Utility of Statistical Phrases and Latent Semantic Indexing for Text Classification", IEEE Int. Conf. on Data Mining, 2002.

 [27] D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivonen, R. Sharma, "Discovering all most specific sequences", ACM TODS, accepted.

 [28] D. Gunopulos, G. Kollios, V. Tsotras, C. Domeniconi, "Selectivity Estimators for Multi-Dimensional Range Queries over Real Attributes", VLDB Journal, Accepted.

 [29] J. Lin, D. Gunopulos, "Dimensionality Reduction by Random Projection and Latent Semantic Indexing", Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining, (2003).

Ph.D, and Master Thesis, Books:
[1] Ilias Tsoukatos:  "Mining Spatiotemporal Data", M.S. Thesis, Univ. Of California, Riverside, February 2001.
[2] Michail Vlachos: "Discovering Similar Trajectories", M.S. Thesis, Univ. Of California, Riverside, 2001.

[3] Carlotta Domeniconi: "Locally Adaptive Techniques for Pattern Classifcation", Ph.D. Thesis, UCR, 8/2002.

[4] Michalis Vazirgiannis, Maria Halkidi, Dimitrios Gunopulos, "Quality Assessment and Uncertainty Handling in Data Mining" Springer-Verlag LNAI Series, 2003.

 

Project Impact:
Carlotta Domeniconi graduated with her Ph.D. degree in Summer 2002. She is now an Assistant Professor at George Masosn University. Her research involves the design of techniques for density approximation of multidimensional datasets, and the theoretical and experimental evaluation of such techniques. Michail Vlachos got a MS degree in 2001 and is currently working on his Ph.D. research. For his MS thesis he desinged and implemented a technique for efficiently discovering similar object trajectories. Ilias Tsoukatos graduated with a MS degree during 2001. For his thesis he designed and implemented a system for mining spatiotemporal datasets, such as ebvironmental data, and finding frequent patterns in such datasets. Dimitrios Papadopoulos is expected to earn his MS. degree in Summer 2003.
Parts of this research are used in a Data Mining graduate course (CS235) taught at the University of California, Riverside (Fall 2000). This is the first time this particular course is taught in the Department.
We have also organized, in collaboration with Dr. Koudas from AT&T Labs, a 5 day tutorial session that was presented under the Center for Discrete Mathematics and Theoretical Computer Sciene (DIMACS) 2001-2004 Special Focus on Data Analysis and Mining, and sponsored by NSF. The tutorial included 16 invited speakers from industry and academia.
For this project we have established a cooperation with  AT&T Labs. Through this cooperation we expect to have access to real datasets and applications. We also cooperate with the Department of  Defense.

 

Outreach Activities:
The project findings are summarized in a dedicated web page:http://www.cs.ucr.edu/~dg/discovery.html
The major findings are published in journals and conferences. We gave two tutorials in SIGKDD 2000 and SIGMOD 2000 on time sequence similarity measures, and a tutorial in SSTD 2001 on mining spatiotemporal data.

Goals, Objectives, and Targeted Activities:
This proposal researches techniques that support efficient data mining operations on geospatial data. In addition it proposes new data mining tasks that can be applied to such data. The problems we are addressing are the following: 

(i) Density estimators to approximate geospatial data sets: We investigate two new techniques for approximating the density of geospatial data sets. These are multi-dimensional histograms with overlapping buckets of variable size and multi-dimensional kernel estimators with adaptive bandwidths [20, 17, 6, 28]. Advanced density estimator techniques allow efficient approximation of range queries on geospatial data, and enable interactive exploratory data mining. 

(ii) Adaptive sampling for data mining algorithms: We will investigate the use of adaptive sampling for batched data mining tasks for geospatial datasets. In adaptive sampling, dense regions of the space are sampled at a different frequency than sparse regions. Adaptive sampling requires an efficient density estimation technique [4, 8]. To address this issue we plan to use the new density estimation techniques we propose above. 

(iii) Similarity models for generalized time series: Spatiotemporal data describe objects that change over time. We propose new similarity models to identify objects that change their location in similar patterns [12, 10, 2], and identify regions that show similar patterns in the change of their attributes [7]. These techniques can be used for clustering objects or for answering nearest neighbor queries. 

(iv) Locally adaptive nearest neighbor classification: In [19, 16, 9, 5, 1, 25, Domeniconi Ph.D. Thesis] we show how we can use local information for improving the performance of nearest neighbor classification techniques on numerical data. 

(v) Set Indexing: In [11] we give a technique that solves approximately the set indexing problem: Given a query set q and a collection of sets S, find the sets in S that are most similar to q. 

(vi) Stream Computing: In [3, 4] we initiate a study on how to efficiently do range query approximation and how to induce SVM classifiers on stream data.

 (vii) Applications of similarity models for trajectories: In [21] we give new indexing techniques for finding similar trajectories. We apply such techniques for matching body movements in [22]. In [24] we give algorithms for clustering gene expression time sequences. 

(viii) Text indexing: In [26] we discuss the effects of using phrases in combination to dimensionality reduction techniques. In [29] we compare the effectiveness of different dimensionality reduction techniques for text retrieval.

 (ix) Association rules mining: In [27] we give a theoretical foundtion on the complexity of the problem of finding all maximal frequent sets.

 

Area  Background:
This project researches basic techniques that support efficient data mining operations in geospatial data, either in a static environment or in a time evolving environment (also termed a spatiotemporal environment). Typically, geospatial data contain geometric or topological information, in addition to other kinds of attributes, such as environmental factors or data related to natural resource exploration. Examples of time evolving geospatial data include data that describe the evolution of natural phenomena, earth science data that describe spatiotemporal phenomena in the atmosphere,  two- or three-dimensional sequences of images of geographical regions, or data that describe the location of individuals in the geographic space as a function of time.  We consider two possible modes of user interaction. In the first, exploratory data mining, the user takes an exploratory part in the discovery process by (i) defining and testing hypotheses, (ii) asking queries on the data and (iii) striving to understand the structure of the data.  Examples of such tasks are answering range queries, or finding the nearest neighbor (or nearest neighbors) to a given object. In the second, batched data mining the user defines the goal that the data mining technique has to achieve. Such data mining tasks include clustering, classification, outlier detection, feature similarity, spatial association rules, trend detection. Speeding up the execution of such tasks, and scaling the algorithms to run for large data sets, is one of the most important problems in data mining research.