National Science Foundation

Project Information:
Project Title: "Query Processing Techniques over Objects with Functional Attributes"
Award Number: IIS-0534781
PI: Vassilis J. Tsotras
co-PI: Dimitrios Gunopulos
Duration:  8/15/2006 - 8/14/2009

Students:
Petko Bakalov
Benjamin Arai
Huseyin Hakkoymaz
Mirella M. Moro

Web Page:
http://www.cs.ucr.edu/~tsotras/functional.html

Project Summary:
The goal of this project is to provide efficient techniques for querying objects with "functional" attributes. Emerging applications (transportation networks, traffic management, meteorological, sensor-based surveillance, mobile services) involve objects that employ one or more functional attributes. These attributes are typically functions of time and/or space (e.g. vehicle speed, precipitation, traffic flow etc.) The importance of functional attributes has been recently recognized by GIS vendors; however their support is still primitive. Functional attributes drastically affect query processing since the query answer is based on the efficient evaluation of the attribute functions. Novel methods and solutions are thus needed. While functions can be very complex, this study concentrates on functions described (and thus stored) in constant space (e.g. polynomials of constant degree). Solutions to various query variations are proposed (aggregations, selections, joins, NNs etc.) while both historical and predictive queries are examined. The approach taken consists of maintaining specialized indices that incrementally compute query results. A prototype that serves as a toolbox of efficient techniques for processing complex objects is also created. The results of this research will improve querying capabilities for the many applications that contain objects with functional attributes (e.g. meteorological, geospatial, cellular, demographic, social, surveillance and traffic management environments, to name a few). The proposed prototype will be available for use by researchers and students. Course materials using this prototype will be developed as well. Through an existing collaboration with a GIS vendor, the results of this project have the potential to impact the implementation of future GIS systems.

 

Research Activities - Year 1:

Below we describe the various research papers that have resulted from this work during the first year of this project. In particular, in [1] we describe how to compute aggregates with range predicates; these are aggregates that include key-range predicates as well (range-temporal aggregates). In this paper we address a novel and practical variation called functional range-temporal aggregation. Here, the value of any record can be a general function over time. The meaning of aggregation is altered such that the contribution of a record to the aggregation result is proportional to the size of the intersection between the record's time interval and the query time interval. We propose various indexing methods that can answer the functional temporal aggregation problem. Both analytical and experimental results show the efficiency of our result.

With the increasing popularity of GPS, RFID and sensor technologies and location-based services, trajectories have become central in many applications. Trajectories are functional objects, since they represent the movement of an object (which is typically stored as a complex function). We have thus made various advances on querying trajectories. In [2] we introduced the distributed spatiotemporal similarity search problem: given a query trajectory Q, we want to find the trajectories that follow a motion similar to Q, when each of the target trajectories is segmented across a number of distributed nodes. This is typically the case when trajectories are gathered in a distributed way (say by a sensor network where each sensor is responsible monitoring a given area). We proposed two novel algorithms, UB-K and UBLB-K, which combine local computations of lower and upper bounds on the matching between the distributed subsequences and Q. Such an operation generates the desired result without pulling together all the distributed subsequences over the fundamentally expensive communication medium.

In [3] we examine trajectory joins over streaming spatiotemporal data. Given a stream of spatiotemporal trajectories created by monitored moving objects, the outcome of a Continuous Spatiotemporal Trajectory Join (CSTJ) query is the set of objects in the stream, which have shown similar behavior over a query-specified time interval, relative to the current timestamp. We propose a novel indexing scheme for streaming spatiotemporal data and develop algorithms for CSTJ evaluation, which utilize the proposed indexing scheme and effectively reduce the computation cost and I/O operations.

In [4] we introduce a novel type of tree-like indexing structure which organizes the trajectory data in an archive, based on their similarity. The structure is also able to provide upper and lower bound estimations over the trajectories. Effectively, each trajectory is approximated by a "wedge" (as has been proposed for time-series data) which is then indexed. Given that the actual functions describing trajectories can be quite complex, this envelope enables for easier indexing. Using the estimations provided by this index we propose a generic framework for efficiently answering a wide range of similarity based trajectory queries like Similarity Threshold query and Similarity Best Fit (SBF) query. The effectiveness of our algorithms in reducing the query computation cost and I/O operations is revealed through a thorough experimental evaluation.

In the past, there have been two main ways to query trajectories, either by "similarity" queries (find all trajectories similar -according to some metric-to a given trajectory) or, by range/NN queries (find all trajectories that pass through this range or are near this point at a given time). We have recently proposed a new way to query trajectories, by "motion pattern" queries; here the user describes a series of predicates (ranges, NN, similarities etc., assigned to a series of temporal constraints) that the trajectories in the answer should follow. In [5] we extended our proposal to an on-line environment. In particular, we introduce a novel query type defined over streaming moving object data, namely, the Continuous Motion Pattern (CMP) Queries. A motion pattern is defined as a sequence of distinct spatial predicates, each attached to a temporal constraint. The spatial predicates can be of various types (range, nearest neighbor, etc.) The temporal constraints are relative to the current time instant and are used to specify the order of the spatial predicates on the time axis. A CMP query is continuously reevaluated over streaming spatiotemporal data, producing the moving objects which satisfy the query's motion pattern. We first introduce a novel indexing scheme for spatiotemporal streams that facilitates the evaluation of the spatial predicates over their temporal constraints. Using this scheme we propose a generic framework for efficiently answering a wide range of CMP queries. Finally, as a first step towards optimizing CMP queries, we present a novel cost model for analyzing CMP query performance.

Based on our collaboration with ESRI, we also examined a problem that relates to maintaining connectivity in a transportation network. Network data models are frequently used as a mechanism to describe the connectivity between spatial features in many emerging GIS applications. The importance of such models has increased recently due to their relevance to many commercial systems (location-based services, transportation design, navigational systems etc.) Such connectivity information is required for solving a wide range of location based queries like finding the shortest path, service areas discovery, allocation and distance matrix computation. Real life networks are dynamic in nature since spatial features can be periodically modified. Such updates may change the connectivity relations with the other features and connectivity needs to be reestablished. Existing approaches are not suitable for a dynamic environment, since whenever a feature change occurs, the whole network connectivity has to be reconstructed from scratch. In [6] we propose an efficient algorithm which incrementally maintains connectivity within a dynamic network. This solution is of great practical value - the amortized cost of maintaining an incrementally rebuildable network is far less than an ordinary network that must be periodically rebuilt in its entirety.

During this period we also worked on a sensor related problem. The ability to provide reliable in-network storage while balancing the energy consumption of individual sensors is a primary concern when deploying a sensor network. The main concern with data-centric storage in sensor networks is the ability to provide reliable and load balanced storage. Energy and wireless range constraints make centralized approaches for storage impractical, and in-network data-centric solutions can be used to reduce the number of messages sent over the network. However, these solutions quickly become expensive when combined with fault-tolerance, load balancing and routing. In [7], we present a novel data-centric storage and query routing mechanism for sensor networks. The routing mechanism is constructed upon the neighborhood information of individual sensors and is completely independent of geographical information. Our data resilient algorithm is capable of recovering from multiple simultaneous failures in the network while adaptively adjusting the load distribution of the newly generated sensor data.

Moreover, we authored two papers for the Encyclopedia of Geographic Information Science that is published by Springer. In particular, [8] examines ways to maintain top-K queries over distributed sensor and trajectory data, while [9] surveys various techniques that have been proposed on indexing mobile objects.

 

Research Activities - Year 2:

During the second year we concentrated our efforts to a specific kind of objects with functional attributes, namely trajectories. Trajectories are functional objects, since they represent the movement of an object (which is typically stored as a complex function). The reason is that such data abounds in GIS applications with the emergence of new location-based services and the wide acceptance of GPS and cellular technologies.

In particular we started the implementation of a prototype system that manages and queries trajectories. In the past, there have been two main ways to query trajectories, either by "similarity/clustering" queries (find all trajectories similar -according to some metric-to a given trajectory) or, by range/NN queries (find all trajectories that pass through this range or are near this point at a given time). We feel that these two queries may either provide too few data (few trajectories typically match the full extend of a given trajectory), or, too much data (many trajectories pass by a given area at a given time). Thus we have advocated a new way to query trajectories, by "motion pattern" queries. In our prototype, the user describes a series of predicates (ranges, NN, similarities etc., assigned to a series of temporal constraints) that the trajectories in the answer should follow. With this framework, the user has many more capabilities to query trajectories; conceptually, we cover the whole spectrum of queries between the two extremes (similarity and range queries). We have begun to implement our framework and have already included the algorithms presented in [5] (on-line approach) as well as typical pattern queries for trajectories collected in an archive.

Moreover, we have greatly extended our trajectory pattern-querying framework by adding "variables". More specifically, we assume that the spatial domain is divided in a set of non-overlapping regions (for simplicity we can assume this to be a regular grid). Then trajectories can be approximated by the sequence of cells they visited (as well as the time intervals they stayed in these cells). What is important is that now we can represent pattern queries as regular expressions over the fixed spatial alphabet (the universe of cells in the grid) and can be implicitly or explicitly anchored to the time domain. An example such query is: /A//B/C, which finds trajectories that first started in area A, then some time later passed by area B and immediately after they visited area C. Adding variables allows for even more flexibility. Consider for example a query: @x/A//B//@x. Here '@x' represents a variable cell. This query finds trajectories that started from some cell (unknown, we denote it by variable @x) and then visited cells A and B (in that order) while finally they returned to the same cell they started from. In [10] we presented new algorithms to solve the patterns with variables. Our methods have been shown to be up to two orders of magnitude faster than the previous approaches. Furthermore, our queries can also include temporal predicates (i.e., over the history of a moving object) and thus apply to task 1 as well. We are planning to add these methods to our prototype during the third year of the project.

We have also examined "density" based queries in this environment, with one example being the so-called "flock" queries. Here we are interested in finding groups of moving objects whose trajectories are within a predefined distance for a certain period of time. Examples of such queries appear in surveillance systems, monitoring convoys of vehicles, flock of animals, etc. However, previous algorithms to find these kind of patterns are not scalable, cannot be applied in streaming data, and/or typically return a superset of the answer needed. In [11] we propose efficient algorithms to find moving patterns in trajectory archives as well as in streaming data.

Furthermore we continued our work on approximation techniques for indexing and approximate querying of spatial and temporal data. In [12] and [13] we focus on improving sampling techniques for geospatial data over sensor networks. Many recent proposed techniques offer error-bounding solutions for aggregate approximation but they cannot guarantee energy spending. Inversely, our goal is to bound the energy consumption while minimizing the approximation error. In [12] we propose an online algorithm, region sampling, for computing approximate aggregates while satisfying a pre-defined energy budget. Our algorithm is distinguished by segmenting a sensor network into partitions of non-overlapping regions and performing sampling and local aggregation for each region. Comprehensive experiments on real-world data sets indicate that our approach is at a minimum of 10% more accurate compared with the previously proposed solutions.

In [13] we propose a two-level sampling approach focusing on peer-to-peer databases for maximizing sample quality given a user-defined communication budget. Given that individual peers may have varying cardinality we propose an algorithm for determining the optimal sample rate (the percentage of tuples to sample from a peer) for each peer. We do this by analyzing the variance of individual peers, ultimately minimizing the total variance of the entire sample. By performing local optimization of individual peer sample rates we maximize approximation accuracy of the samples. We also offer several techniques for sampling in peer-to-peer databases given various levels of known information and unknown information about the network and its peers.

In [14] and [15] we focus on time series and provide techniques for approximate similarity search and summarization respectively. In particular, in [14] we introduced an Embedding-Based Subsequence Matching (EBSM) method for approximate subsequence matching. EBSM significantly improves the efficiency of subsequence matching in large time series data sets under the dynamic time warping (DTW) distance measure.

Typically work on time-series representation has concentrated on approaches that are calculated in batch mode and represent each value with approximately equal fidelity. However, the increasing deployment of mobile devices and real-time sensors requires representations that can be incrementally updated and can approximate the data with fidelity proportional to its age. The latter property allows us to answer queries about the recent past with greater precision, since in many domains, recent information is more useful than older information. We call such representations "amnesic". In [15] we introduce a novel representation of time series that can represent arbitrary user-specified amnesic functions. For example, a meteorologist may decide that data that is twice as old can tolerate twice as much error and thus specify a linear amnesic function. In contrast, an econometrist might opt for an exponential amnesic function. Amnesic functions correspond to yet another dimension of functional attributes that this project deals with. We propose online algorithms for our representation and discuss their properties. Finally, we perform an extensive empirical evaluation on 40 data sets and show that our approach can efficiently maintain a high-quality amnesic approximation.

We continued our collaboration with ESRI. In [16] we describe a flexible versioning scheme for network models (which describe the connectivity between spatial features in GIS applications). It turns out that the standard database mechanisms for concurrency control, which include transactions and locking protocols, do not provide the support needed for updating geographic data. The preferred method to resolve conflicts in GIS systems is the use of multiple versions of the data to encapsulate the modifications generated by the end users. Our solution is based on marking "dirty" areas and "dirty" objects (regions and objects which contain conflicts between multiple users) and subsequent cleaning them.

In [17] and [18] we describe distributed techniques for managing spatial and temporal data and streams. During this period we co-authored four new papers for the Encyclopedia of Geographic Information Science. In particular, [19] examines ways to discover similar trajectories, [20] summarizes work on indexing spatiotemporal archives, [21] presents a survey of extensible index structures, while [22] is an introduction to R-trees.

Finally, during this period we gave two keynote speeches for SSTD'07 and MobiDE'08 [23, 24].
 

Project References:

[1] Donghui Zhang, Alexander Markowetz, Vassilis J. Tsotras, Dimitrios Gunopulos, Bernhard Seeger, "On Computing Temporal Aggregates with Range Predicates", ACM TODS, 33(2): (2008).

[2] Demetrios Zeinalipour-Yazti, Song Lin, Dimitrios Gunopulos, "Distributed Spatio-Temporal Similarity Search", the 15th ACM Conference on Information and Knowledge Management (CIKM), Arlington, VA, November 2006.

[3] Petko Bakalov, Vassilis J. Tsotras, "Continuous Spatiotemporal Trajectory Joins", 2nd International Conference on Geosensor Networks (GSN'06), Boston, MA, October 1-3, 2006; extended version to appear in Springer Verlag volume.

[4] Petko Bakalov, Eamonn Keogh, Vassilis J. Tsotras, "TS2-tree - an efficient similarity based organization for trajectory data". ACM GIS 2007, p. 58 (poster paper; best poster runner-up).

[5] Petko Bakalov, Vassilis J. Tsotras, "A Generic Framework for Continuous Motion Pattern Query Evaluation". ICDE 2008, pp: 80-89.

[6] Petko Bakalov, Erik Hoel, Wee-Liang Heng, Vassilis J. Tsotras, "Maintaining Connectivity and Versioning in Dynamic Multimodal Networks". ICDE 2008, pp: 1267-1276.

[7] Song Lin, Benjamin Arai, Dimitrios Gunopulos, "Reliable Hierarchical Data Storage in Sensor Networks", 19th International Conference on Scientific and Statistical Database Management (SSDBM), Banff, Canada, to appear, July 2007.

[8] Song Lin, Demetrios Zeinalipour-Yazti, Dimitrios Gunopulos, "Top-k Retrieval Techniques in Distributed Sensor Systems", Encyclopedia of Geographical Information Science, Springer, 2008.

[9] George Kollios, Dimitrios Gunopulos, Vassilis J. Tsotras, "Mobile Object Indexing", Encyclopedia of Geographical Information Science, Springer, 2008.

[10] Marcos Vieira, Petko Bakalov, Vassilis J. Tsotras, "Querying Trajectories Using Flexible Patterns". Submitted for publication.

[11] Marcos Vieira, Petko Bakalov, Vassilis J. Tsotras, "Incremental Algorithms for Evaluation of Flock Queries over Spatiotemporal Trajectories". Submitted for publication.

[12] Song Lin, Benjamin Arai, Dimitrios Gunopulos, Gautam Das, "Region Sampling: Continuous Adaptive Sampling on Sensor Networks". ICDE 2008: 794-803.

[13] Benjamin Arai, Song Lin, Dimitrios Gunopulos, "Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks". ICDM 2007: 23-32.

[14] Vassilis Athitsos, Panagiotis Papapetrou, Michalis Potamias, George Kollios, Dimitrios Gunopulos, "Approximate embedding-based subsequence matching of time series". SIGMOD Conference 2008: 365-378.

[15] Themis Palpanas, Michail Vlachos, Eamonn J. Keogh, Dimitrios Gunopulos, "Streaming Time Series Summarization Using User-Defined Amnesic Functions". IEEE Trans. Knowl. Data Eng. 20(7): 992-1006 (2008).

[16] Petko Bakalov, Erik G. Hoel, Wee-Liang Heng, Vassilis J. Tsotras, "Versioning network models in a multiuser environment". Submitted for publication.

[17] Vana Kalogeraki, Dimitrios Gunopulos, Ravi S. Sandhu, Bhavani M. Thuraisingham, "QoS Aware Dependable Distributed Stream Processing". ISORC 2008: 69-75.

[18] Themis Palpanas, Vana Kalogeraki, Dimitrios Gunopulos, "Online Distribution Estimation for Streaming Data: Framework and Applications". SEBD 2007: 430-438.

[19] George Kollios, Michail Vlachos, Dimitrios Gunopulos, "Discovering Similar Trajectories". Encyclopedia of GIS 2008, pp: 1168-1173.

[20] Marios Hadjieleftheriou, George Kollios, Vassilis J. Tsotras, Dimitrios Gunopulos, "Indexing Spatio-temporal Archives". Encyclopedia of Geographical Information Science, Springer, 2008, pp: 530-538.

[21] Marios Hadjieleftheriou, Erik G. Hoel, Vassilis J. Tsotras, "Extensible Index Structures". Encyclopedia of GIS Geographical Information Science, Springer, 2008, pp: 483-493.

[22] Marios Hadjieleftheriou, Yannis Manolopoulos, Yannis Theodoridis, Vassilis J. Tsotras, "R-Trees - A Dynamic Index Structure for Spatial Searching". Encyclopedia of Geographical Information Science, Springer, 2008, pp: 993-1002.

[23] Vassilis J. Tsotras, "Recent Advances on Querying and Managing Trajectories". Keynote Speech, 10th International Symposium, SSTD 2007, Boston, MA, July 16-18, 2007.

[24] Vassilis J. Tsotras, "Complex Pattern Queries over Spatiotemporal Trajectories". Keynote Speech, 7th ACM International Workshop on Data Engineering for Wireless and Mobile Access (MobiDE'08), Vancouver, Canada, June 13, 2008.

 

 Presentations:

Vassilis J. Tsotras: "Recent Advances on Querying and Managing Trajectories" SSTD 2007 Keynote Speech; slides available at: http://www.cs.ust.hk/~sstd07/slides/SSTD-07-keynote-Tsotras.ppt).

Point of Contact: Vassilis J. Tsotras

 

Date of Last Update: July 20th, 2008