Project
Information:
Project Title: "Query Processing Techniques over Objects
with Functional Attributes"
Award Number: IIS-0534781
PI: Vassilis J. Tsotras
co-PI: Dimitrios Gunopulos
Duration: 8/15/2006 -
8/14/2009
Students:
Petko Bakalov
Benjamin Arai
Huseyin Hakkoymaz
Mirella M. Moro
Web Page:
http://www.cs.ucr.edu/~tsotras/functional.html
Project Summary:
The goal of this project is to provide efficient techniques
for querying objects with "functional" attributes. Emerging applications
(transportation networks, traffic management, meteorological, sensor-based
surveillance, mobile services) involve objects that employ one or more
functional attributes. These attributes are typically functions of time and/or
space (e.g. vehicle speed, precipitation, traffic flow etc.) The importance of
functional attributes has been recently recognized by GIS vendors; however their
support is still primitive. Functional attributes drastically affect query
processing since the query answer is based on the efficient evaluation of the
attribute functions. Novel methods and solutions are thus needed. While
functions can be very complex, this study concentrates on functions described
(and thus stored) in constant space (e.g. polynomials of constant degree).
Solutions to various query variations are proposed (aggregations, selections,
joins, NNs etc.) while both historical and predictive queries are examined. The
approach taken consists of maintaining specialized indices that incrementally
compute query results. A prototype that serves as a toolbox of efficient
techniques for processing complex objects is also created. The results of this
research will improve querying capabilities for the many applications that
contain objects with functional attributes (e.g. meteorological, geospatial,
cellular, demographic, social, surveillance and traffic management environments,
to name a few). The proposed prototype will be available for use by researchers
and students. Course materials using this prototype will be developed as well.
Through an existing collaboration with a GIS vendor, the results of this project
have the potential to impact the implementation of future GIS systems.
Below we describe the various research papers that
have resulted from this work during the first year of this project. In
particular, in [1] we describe how to compute aggregates with range
predicates; these are aggregates that include key-range predicates as well
(range-temporal
aggregates). In this paper we
address a novel and practical variation called functional
range-temporal aggregation.
Here, the value of any record can be a general function over time. The meaning
of aggregation is altered such that the contribution of a record to the
aggregation result is proportional to the size of the intersection between the
record's time interval and the query time interval. We propose various indexing
methods that can answer the functional temporal aggregation problem. Both
analytical and experimental results show the efficiency of our
result.
With the increasing popularity of GPS, RFID
and sensor technologies and location-based services, trajectories have become
central in many applications. Trajectories are functional objects, since they
represent the movement of an object (which is typically stored as a complex
function). We have thus made various advances on querying trajectories. In [2]
we introduced the distributed spatiotemporal similarity search problem: given a
query trajectory Q, we want to find the trajectories that follow a motion
similar to Q, when each of the target trajectories is segmented across a number
of distributed nodes. This is typically the case when trajectories are gathered
in a distributed way (say by a sensor network where each sensor is responsible
monitoring a given area). We proposed two novel algorithms, UB-K and UBLB-K,
which combine local computations of lower and upper bounds on the matching
between the distributed subsequences and Q. Such an operation generates the
desired result without pulling together all the distributed subsequences over
the fundamentally expensive communication medium.
In [3] we examine trajectory joins over
streaming spatiotemporal data. Given a stream of spatiotemporal trajectories
created by monitored moving objects, the outcome of a Continuous Spatiotemporal
Trajectory Join (CSTJ) query is the set of objects in the stream, which have
shown similar behavior over a query-specified time interval, relative to the
current timestamp. We propose a novel indexing scheme for streaming
spatiotemporal data and develop algorithms for CSTJ evaluation, which utilize
the proposed indexing scheme and effectively reduce the computation cost and I/O
operations.
In [4] we introduce a novel type of tree-like indexing structure which organizes the trajectory data in an archive, based on their similarity. The structure is also able to provide upper and lower bound estimations over the trajectories. Effectively, each trajectory is approximated by a "wedge" (as has been proposed for time-series data) which is then indexed. Given that the actual functions describing trajectories can be quite complex, this envelope enables for easier indexing. Using the estimations provided by this index we propose a generic framework for efficiently answering a wide range of similarity based trajectory queries like Similarity Threshold query and Similarity Best Fit (SBF) query. The effectiveness of our algorithms in reducing the query computation cost and I/O operations is revealed through a thorough experimental evaluation.
In the past, there have been two
main ways to query trajectories, either by "similarity" queries (find all
trajectories similar -according to some metric-to a given trajectory) or, by
range/NN queries (find all trajectories that pass through this range or are near
this point at a given time). We have recently proposed a new way to query
trajectories, by "motion pattern" queries; here the user describes a series of
predicates (ranges, NN, similarities etc., assigned to a series of temporal
constraints) that the trajectories in the answer should follow. In [5] we
extended our proposal to an on-line environment. In particular, we introduce a
novel query type defined over streaming moving object data, namely, the
Continuous Motion Pattern (CMP) Queries. A motion pattern is defined as a
sequence of distinct spatial predicates, each attached to a temporal constraint.
The spatial predicates can be of various types (range, nearest neighbor, etc.)
The temporal constraints are relative to the current time instant and are used
to specify the order of the spatial predicates on the time axis. A CMP query is
continuously reevaluated over streaming spatiotemporal data, producing the
moving objects which satisfy the query's motion pattern. We first introduce a
novel indexing scheme for spatiotemporal streams that facilitates the evaluation
of the spatial predicates over their temporal constraints. Using this scheme we
propose a generic framework for efficiently answering a wide range of CMP
queries. Finally, as a first step towards optimizing CMP queries, we present a
novel cost model for analyzing CMP query performance.
Based on our collaboration with ESRI, we also examined a problem that relates to maintaining connectivity in a transportation network. Network data models are frequently used as a mechanism to describe the connectivity between spatial features in many emerging GIS applications. The importance of such models has increased recently due to their relevance to many commercial systems (location-based services, transportation design, navigational systems etc.) Such connectivity information is required for solving a wide range of location based queries like finding the shortest path, service areas discovery, allocation and distance matrix computation. Real life networks are dynamic in nature since spatial features can be periodically modified. Such updates may change the connectivity relations with the other features and connectivity needs to be reestablished. Existing approaches are not suitable for a dynamic environment, since whenever a feature change occurs, the whole network connectivity has to be reconstructed from scratch. In [6] we propose an efficient algorithm which incrementally maintains connectivity within a dynamic network. This solution is of great practical value - the amortized cost of maintaining an incrementally rebuildable network is far less than an ordinary network that must be periodically rebuilt in its entirety.
During this period we also worked on a sensor related problem. The ability to provide reliable in-network storage while balancing the energy consumption of individual sensors is a primary concern when deploying a sensor network. The main concern with data-centric storage in sensor networks is the ability to provide reliable and load balanced storage. Energy and wireless range constraints make centralized approaches for storage impractical, and in-network data-centric solutions can be used to reduce the number of messages sent over the network. However, these solutions quickly become expensive when combined with fault-tolerance, load balancing and routing. In [7], we present a novel data-centric storage and query routing mechanism for sensor networks. The routing mechanism is constructed upon the neighborhood information of individual sensors and is completely independent of geographical information. Our data resilient algorithm is capable of recovering from multiple simultaneous failures in the network while adaptively adjusting the load distribution of the newly generated sensor data.
Moreover,
we authored two papers for the Encyclopedia of Geographic Information Science
that is published by Springer. In particular, [8] examines ways to maintain
top-K queries over distributed sensor and trajectory data, while [9] surveys
various techniques that have been proposed on indexing mobile
objects.
Research Activities - Year 2:
During
the second year we concentrated our efforts to a specific kind of objects with
functional attributes, namely trajectories. Trajectories are functional objects,
since they represent the movement of an object (which is typically stored as a
complex function). The reason is that such data abounds in GIS applications with
the emergence of new location-based services and the wide acceptance of GPS and
cellular technologies.
In particular we started the implementation of a prototype system that manages and queries trajectories. In the past, there have been two main ways to query trajectories, either by "similarity/clustering" queries (find all trajectories similar -according to some metric-to a given trajectory) or, by range/NN queries (find all trajectories that pass through this range or are near this point at a given time). We feel that these two queries may either provide too few data (few trajectories typically match the full extend of a given trajectory), or, too much data (many trajectories pass by a given area at a given time). Thus we have advocated a new way to query trajectories, by "motion pattern" queries. In our prototype, the user describes a series of predicates (ranges, NN, similarities etc., assigned to a series of temporal constraints) that the trajectories in the answer should follow. With this framework, the user has many more capabilities to query trajectories; conceptually, we cover the whole spectrum of queries between the two extremes (similarity and range queries). We have begun to implement our framework and have already included the algorithms presented in [5] (on-line approach) as well as typical pattern queries for trajectories collected in an archive.
Moreover, we have greatly extended
our trajectory pattern-querying framework by adding "variables". More
specifically, we assume that the spatial domain is divided in a set of
non-overlapping regions (for simplicity we can assume this to be a regular
grid). Then trajectories can be approximated by the sequence of cells they
visited (as well as the time intervals they stayed in these cells). What is
important is that now we can represent pattern queries as regular expressions
over the fixed spatial alphabet (the universe of cells in the grid) and can be
implicitly or explicitly anchored to the time domain. An example such query is:
/A//B/C, which finds trajectories that first started in area A, then some time
later passed by area B and immediately after they visited area C. Adding
variables allows for even more flexibility. Consider for example a query:
@x/A//B//@x. Here '@x' represents a variable cell. This query finds trajectories
that started from some cell (unknown, we denote it by variable @x) and then
visited cells A and B (in that order) while finally they returned to the same
cell they started from. In [10] we presented new algorithms to solve the
patterns with variables. Our methods have been shown to be up to two orders of
magnitude faster than the previous approaches. Furthermore, our queries can also
include temporal predicates (i.e., over the history of a moving object) and thus
apply to task 1 as well. We are planning to add these methods to our prototype
during the third year of the project.
We have also examined "density" based queries in this environment, with one example being the so-called "flock" queries. Here we are interested in finding groups of moving objects whose trajectories are within a predefined distance for a certain period of time. Examples of such queries appear in surveillance systems, monitoring convoys of vehicles, flock of animals, etc. However, previous algorithms to find these kind of patterns are not scalable, cannot be applied in streaming data, and/or typically return a superset of the answer needed. In [11] we propose efficient algorithms to find moving patterns in trajectory archives as well as in streaming data.
Furthermore we continued our work on approximation techniques for indexing and approximate querying of spatial and temporal data. In [12] and [13] we focus on improving sampling techniques for geospatial data over sensor networks. Many recent proposed techniques offer error-bounding solutions for aggregate approximation but they cannot guarantee energy spending. Inversely, our goal is to bound the energy consumption while minimizing the approximation error. In [12] we propose an online algorithm, region sampling, for computing approximate aggregates while satisfying a pre-defined energy budget. Our algorithm is distinguished by segmenting a sensor network into partitions of non-overlapping regions and performing sampling and local aggregation for each region. Comprehensive experiments on real-world data sets indicate that our approach is at a minimum of 10% more accurate compared with the previously proposed solutions.
In [13] we propose a two-level sampling approach focusing on peer-to-peer databases for maximizing sample quality given a user-defined communication budget. Given that individual peers may have varying cardinality we propose an algorithm for determining the optimal sample rate (the percentage of tuples to sample from a peer) for each peer. We do this by analyzing the variance of individual peers, ultimately minimizing the total variance of the entire sample. By performing local optimization of individual peer sample rates we maximize approximation accuracy of the samples. We also offer several techniques for sampling in peer-to-peer databases given various levels of known information and unknown information about the network and its peers.
In [14] and [15] we focus on time series and provide techniques for approximate similarity search and summarization respectively. In particular, in [14] we introduced an Embedding-Based Subsequence Matching (EBSM) method for approximate subsequence matching. EBSM significantly improves the efficiency of subsequence matching in large time series data sets under the dynamic time warping (DTW) distance measure.
Typically work on time-series representation has concentrated on approaches that are calculated in batch mode and represent each value with approximately equal fidelity. However, the increasing deployment of mobile devices and real-time sensors requires representations that can be incrementally updated and can approximate the data with fidelity proportional to its age. The latter property allows us to answer queries about the recent past with greater precision, since in many domains, recent information is more useful than older information. We call such representations "amnesic". In [15] we introduce a novel representation of time series that can represent arbitrary user-specified amnesic functions. For example, a meteorologist may decide that data that is twice as old can tolerate twice as much error and thus specify a linear amnesic function. In contrast, an econometrist might opt for an exponential amnesic function. Amnesic functions correspond to yet another dimension of functional attributes that this project deals with. We propose online algorithms for our representation and discuss their properties. Finally, we perform an extensive empirical evaluation on 40 data sets and show that our approach can efficiently maintain a high-quality amnesic approximation.
We continued our collaboration with
ESRI. In [16] we describe a flexible versioning scheme for network models (which
describe the connectivity between spatial features in GIS applications). It
turns out that the standard database mechanisms for concurrency control, which
include transactions and locking protocols, do not provide the support needed
for updating geographic data. The preferred method to resolve conflicts in GIS
systems is the use of multiple versions of the data to encapsulate the
modifications generated by the end users. Our solution is based on marking
"dirty" areas and "dirty" objects (regions and objects which contain conflicts
between multiple users) and subsequent cleaning them.
In
[17] and [18] we describe distributed techniques for managing spatial and
temporal data and streams. During
this period we co-authored four new papers for the Encyclopedia of Geographic
Information Science. In particular, [19] examines ways to discover similar
trajectories, [20] summarizes work on indexing spatiotemporal archives, [21]
presents a survey of extensible index structures, while [22] is an introduction
to R-trees.
Finally,
during this period we gave two keynote speeches for SSTD'07 and MobiDE'08 [23,
24].
[1] Donghui Zhang, Alexander Markowetz, Vassilis J. Tsotras, Dimitrios Gunopulos, Bernhard Seeger, "On Computing Temporal Aggregates with Range Predicates", ACM TODS, 33(2): (2008).
[2] Demetrios Zeinalipour-Yazti, Song Lin, Dimitrios Gunopulos, "Distributed Spatio-Temporal Similarity Search", the 15th ACM Conference on Information and Knowledge Management (CIKM), Arlington, VA, November 2006.
[3] Petko Bakalov, Vassilis J. Tsotras, "Continuous Spatiotemporal Trajectory Joins", 2nd International Conference on Geosensor Networks (GSN'06), Boston, MA, October 1-3, 2006; extended version to appear in Springer Verlag volume.
[4] Petko Bakalov, Eamonn Keogh, Vassilis J. Tsotras, "TS2-tree - an efficient similarity based organization for trajectory data". ACM GIS 2007, p. 58 (poster paper; best poster runner-up).
[5] Petko Bakalov, Vassilis J. Tsotras, "A Generic Framework for Continuous Motion Pattern Query Evaluation". ICDE 2008, pp: 80-89.
[6] Petko Bakalov, Erik Hoel, Wee-Liang Heng, Vassilis J. Tsotras, "Maintaining Connectivity and Versioning in Dynamic Multimodal Networks". ICDE 2008, pp: 1267-1276.
[7] Song Lin, Benjamin
Arai, Dimitrios Gunopulos, "Reliable Hierarchical Data Storage in Sensor
Networks", 19th International Conference on Scientific and Statistical Database
Management (SSDBM), Banff, Canada, to appear, July 2007.
[8] Song Lin, Demetrios Zeinalipour-Yazti, Dimitrios Gunopulos, "Top-k Retrieval Techniques in Distributed Sensor Systems", Encyclopedia of Geographical Information Science, Springer, 2008.
[9] George Kollios, Dimitrios Gunopulos, Vassilis J. Tsotras, "Mobile Object Indexing", Encyclopedia of Geographical Information Science, Springer, 2008.
[10] Marcos Vieira, Petko Bakalov, Vassilis J. Tsotras, "Querying
Trajectories Using Flexible Patterns". Submitted for
publication.
[11] Marcos Vieira, Petko Bakalov, Vassilis J. Tsotras, "Incremental Algorithms for Evaluation of Flock Queries over Spatiotemporal Trajectories". Submitted for publication.
[13] Benjamin Arai, Song Lin, Dimitrios Gunopulos, "Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks". ICDM 2007: 23-32.
[14] Vassilis Athitsos, Panagiotis Papapetrou, Michalis Potamias, George
Kollios, Dimitrios Gunopulos, "Approximate embedding-based subsequence matching
of time series". SIGMOD Conference 2008: 365-378.
[15] Themis Palpanas, Michail Vlachos, Eamonn J. Keogh, Dimitrios
Gunopulos, "Streaming Time Series Summarization Using User-Defined Amnesic
Functions". IEEE Trans. Knowl. Data
[16] Petko Bakalov, Erik G. Hoel, Wee-Liang Heng, Vassilis J. Tsotras, "Versioning network models in a multiuser environment". Submitted for publication.
[18] Themis Palpanas, Vana Kalogeraki, Dimitrios Gunopulos, "Online Distribution Estimation for Streaming Data: Framework and Applications". SEBD 2007: 430-438.
[19] George Kollios, Michail Vlachos, Dimitrios Gunopulos, "Discovering Similar Trajectories". Encyclopedia of GIS 2008, pp: 1168-1173.
[20] Marios Hadjieleftheriou, George Kollios, Vassilis J. Tsotras, Dimitrios Gunopulos, "Indexing Spatio-temporal Archives". Encyclopedia of Geographical Information Science, Springer, 2008, pp: 530-538.
[21] Marios Hadjieleftheriou, Erik G. Hoel, Vassilis J. Tsotras, "Extensible Index Structures". Encyclopedia of GIS Geographical Information Science, Springer, 2008, pp: 483-493.
[22] Marios Hadjieleftheriou, Yannis Manolopoulos, Yannis Theodoridis, Vassilis J. Tsotras, "R-Trees - A Dynamic Index Structure for Spatial Searching". Encyclopedia of Geographical Information Science, Springer, 2008, pp: 993-1002.
[23] Vassilis J. Tsotras,
"Recent Advances on Querying and Managing Trajectories". Keynote Speech, 10th
International Symposium, SSTD 2007,
[24] Vassilis J. Tsotras, "Complex Pattern
Queries over Spatiotemporal Trajectories". Keynote Speech, 7th ACM International
Workshop on Data Engineering for Wireless and Mobile Access (MobiDE'08),
Vassilis J. Tsotras: "Recent Advances on Querying and Managing Trajectories" SSTD 2007 Keynote Speech; slides available at: http://www.cs.ust.hk/~sstd07/slides/SSTD-07-keynote-Tsotras.ppt).
Point of Contact: Vassilis J. Tsotras
Date of Last Update: July 20th, 2008