National Science Foundation

Project Information:
Project Title: "
Query Processing Techniques over Objects with Functional Attributes"
Award Number: IIS-0534781
PI: Vassilis J. Tsotras
co-PI: Dimitrios Gunopulos
Duration:  8/15/2006 - 8/14/2009

Petko Bakalov, Benjamin Arai, Huseyin Hakkoymaz, Mirella M. Moro, Theodoros Lappas, Marcos Vieira

Web Page:

Project Summary:

The goal of this project is to provide efficient techniques for querying objects with "functional" attributes. Emerging applications (transportation networks, traffic management, meteorological, sensor-based surveillance, mobile services) involve objects that employ one or more functional attributes. These attributes are typically functions of time and/or space (e.g. vehicle speed, precipitation, traffic flow etc.) The importance of functional attributes has been recently recognized by GIS vendors; however their support is still primitive. Functional attributes drastically affect query processing since the query answer is based on the efficient evaluation of the attribute functions. Novel methods and solutions are thus needed. While functions can be very complex, this study concentrates on functions described (and thus stored) in constant space (e.g. polynomials of constant degree). Solutions to various query variations are proposed (aggregations, selections, joins, NNs etc.) while both historical and predictive queries are examined. The approach taken consists of maintaining specialized indices that incrementally compute query results. A prototype that serves as a toolbox of efficient techniques for processing complex objects is also created. The results of this research will improve querying capabilities for the many applications that contain objects with functional attributes (e.g. meteorological, geospatial, cellular, demographic, social, surveillance and traffic management environments, to name a few). The proposed prototype will be available for use by researchers and students. Course materials using this prototype will be developed as well. Through an existing collaboration with a GIS vendor, the results of this project have the potential to impact the implementation of future GIS systems.


Research Activities - Year 1:

Below we describe the various research papers that have resulted from this work during the first year of this project. In particular, in [1] we describe how to compute aggregates with range predicates; these are aggregates that include key-range predicates as well (range-temporal aggregates). In this paper we address a novel and practical variation called functional range-temporal aggregation. Here, the value of any record can be a general function over time. The meaning of aggregation is altered such that the contribution of a record to the aggregation result is proportional to the size of the intersection between the record's time interval and the query time interval. We propose various indexing methods that can answer the functional temporal aggregation problem. Both analytical and experimental results show the efficiency of our result.

With the increasing popularity of GPS, RFID and sensor technologies and location-based services, trajectories have become central in many applications. Trajectories are functional objects, since they represent the movement of an object (which is typically stored as a complex function). We have thus made various advances on querying trajectories. In [2] we introduced the distributed spatiotemporal similarity search problem: given a query trajectory Q, we want to find the trajectories that follow a motion similar to Q, when each of the target trajectories is segmented across a number of distributed nodes. This is typically the case when trajectories are gathered in a distributed way (say by a sensor network where each sensor is responsible monitoring a given area). We proposed two novel algorithms, UB-K and UBLB-K, which combine local computations of lower and upper bounds on the matching between the distributed subsequences and Q. Such an operation generates the desired result without pulling together all the distributed subsequences over the fundamentally expensive communication medium.

In [3] we examine trajectory joins over streaming spatiotemporal data. Given a stream of spatiotemporal trajectories created by monitored moving objects, the outcome of a Continuous Spatiotemporal Trajectory Join (CSTJ) query is the set of objects in the stream, which have shown similar behavior over a query-specified time interval, relative to the current timestamp. We propose a novel indexing scheme for streaming spatiotemporal data and develop algorithms for CSTJ evaluation, which utilize the proposed indexing scheme and effectively reduce the computation cost and I/O operations.

In [4] we introduce a novel type of tree-like indexing structure which organizes the trajectory data in an archive, based on their similarity. The structure is also able to provide upper and lower bound estimations over the trajectories. Effectively, each trajectory is approximated by a "wedge" (as has been proposed for time-series data) which is then indexed. Given that the actual functions describing trajectories can be quite complex, this envelope enables for easier indexing. Using the estimations provided by this index we propose a generic framework for efficiently answering a wide range of similarity based trajectory queries like Similarity Threshold query and Similarity Best Fit (SBF) query. The effectiveness of our algorithms in reducing the query computation cost and I/O operations is revealed through a thorough experimental evaluation.

In the past, there have been two main ways to query trajectories, either by "similarity" queries (find all trajectories similar -according to some metric-to a given trajectory) or, by range/NN queries (find all trajectories that pass through this range or are near this point at a given time). We have recently proposed a new way to query trajectories, by "motion pattern" queries; here the user describes a series of predicates (ranges, NN, similarities etc., assigned to a series of temporal constraints) that the trajectories in the answer should follow. In [5] we extended our proposal to an on-line environment. In particular, we introduce a novel query type defined over streaming moving object data, namely, the Continuous Motion Pattern (CMP) Queries. A motion pattern is defined as a sequence of distinct spatial predicates, each attached to a temporal constraint. The spatial predicates can be of various types (range, nearest neighbor, etc.) The temporal constraints are relative to the current time instant and are used to specify the order of the spatial predicates on the time axis. A CMP query is continuously reevaluated over streaming spatiotemporal data, producing the moving objects which satisfy the query's motion pattern. We first introduce a novel indexing scheme for spatiotemporal streams that facilitates the evaluation of the spatial predicates over their temporal constraints. Using this scheme we propose a generic framework for efficiently answering a wide range of CMP queries. Finally, as a first step towards optimizing CMP queries, we present a novel cost model for analyzing CMP query performance.

Based on our collaboration with ESRI, we also examined a problem that relates to maintaining connectivity in a transportation network. Network data models are frequently used as a mechanism to describe the connectivity between spatial features in many emerging GIS applications. The importance of such models has increased recently due to their relevance to many commercial systems (location-based services, transportation design, navigational systems etc.) Such connectivity information is required for solving a wide range of location based queries like finding the shortest path, service areas discovery, allocation and distance matrix computation. Real life networks are dynamic in nature since spatial features can be periodically modified. Such updates may change the connectivity relations with the other features and connectivity needs to be reestablished. Existing approaches are not suitable for a dynamic environment, since whenever a feature change occurs, the whole network connectivity has to be reconstructed from scratch. In [6] we propose an efficient algorithm which incrementally maintains connectivity within a dynamic network. This solution is of great practical value - the amortized cost of maintaining an incrementally rebuildable network is far less than an ordinary network that must be periodically rebuilt in its entirety.

During this period we also worked on a sensor related problem. The ability to provide reliable in-network storage while balancing the energy consumption of individual sensors is a primary concern when deploying a sensor network. The main concern with data-centric storage in sensor networks is the ability to provide reliable and load balanced storage. Energy and wireless range constraints make centralized approaches for storage impractical, and in-network data-centric solutions can be used to reduce the number of messages sent over the network. However, these solutions quickly become expensive when combined with fault-tolerance, load balancing and routing. In [7], we present a novel data-centric storage and query routing mechanism for sensor networks. The routing mechanism is constructed upon the neighborhood information of individual sensors and is completely independent of geographical information. Our data resilient algorithm is capable of recovering from multiple simultaneous failures in the network while adaptively adjusting the load distribution of the newly generated sensor data.

Moreover, we authored two papers for the Encyclopedia of Geographic Information Science that is published by Springer. In particular, [8] examines ways to maintain top-K queries over distributed sensor and trajectory data, while [9] surveys various techniques that have been proposed on indexing mobile objects.

Research Activities - Year 2:

During the second year we concentrated our efforts to a specific kind of objects with functional attributes, namely trajectories. Trajectories are functional objects, since they represent the movement of an object (which is typically stored as a complex function). The reason is that such data abounds in GIS applications with the emergence of new location-based services and the wide acceptance of GPS and cellular technologies.

In particular we started the implementation of a prototype system that manages and queries trajectories. In the past, there have been two main ways to query trajectories, either by "similarity/clustering" queries (find all trajectories similar -according to some metric-to a given trajectory) or, by range/NN queries (find all trajectories that pass through this range or are near this point at a given time). We feel that these two queries may either provide too few data (few trajectories typically match the full extend of a given trajectory), or, too much data (many trajectories pass by a given area at a given time). Thus we have advocated a new way to query trajectories, by "motion pattern" queries. In our prototype, the user describes a series of predicates (ranges, NN, similarities etc., assigned to a series of temporal constraints) that the trajectories in the answer should follow. With this framework, the user has many more capabilities to query trajectories; conceptually, we cover the whole spectrum of queries between the two extremes (similarity and range queries). We have begun to implement our framework and have already included the algorithms presented in [5] (on-line approach) as well as typical pattern queries for trajectories collected in an archive.

Moreover, we have greatly extended our trajectory pattern-querying framework by adding "variables". More specifically, we assume that the spatial domain is divided in a set of non-overlapping regions (for simplicity we can assume this to be a regular grid). Then trajectories can be approximated by the sequence of cells they visited (as well as the time intervals they stayed in these cells). What is important is that now we can represent pattern queries as regular expressions over the fixed spatial alphabet (the universe of cells in the grid) and can be implicitly or explicitly anchored to the time domain. An example such query is: /A//B/C, which finds trajectories that first started in area A, then some time later passed by area B and immediately after they visited area C. Adding variables allows for even more flexibility. Consider for example a query: @x/A//B//@x. Here '@x' represents a variable cell. This query finds trajectories that started from some cell (unknown, we denote it by variable @x) and then visited cells A and B (in that order) while finally they returned to the same cell they started from. In [10] we presented new algorithms to solve the patterns with variables. Our methods have been shown to be up to two orders of magnitude faster than the previous approaches. Furthermore, our queries can also include temporal predicates (i.e., over the history of a moving object) and thus apply to task 1 as well. We are planning to add these methods to our prototype during the third year of the project.

We have also examined "density" based queries in this environment, with one example being the so-called "flock" queries. Here we are interested in finding groups of moving objects whose trajectories are within a predefined distance for a certain period of time. Examples of such queries appear in surveillance systems, monitoring convoys of vehicles, flock of animals, etc. However, previous algorithms to find these kind of patterns are not scalable, cannot be applied in streaming data, and/or typically return a superset of the answer needed. In [11] we propose efficient algorithms to find moving patterns in trajectory archives as well as in streaming data.

Furthermore we continued our work on approximation techniques for indexing and approximate querying of spatial and temporal data. In [12] and [13] we focus on improving sampling techniques for geospatial data over sensor networks. Many recent proposed techniques offer error-bounding solutions for aggregate approximation but they cannot guarantee energy spending. Inversely, our goal is to bound the energy consumption while minimizing the approximation error. In [12] we propose an online algorithm, region sampling, for computing approximate aggregates while satisfying a pre-defined energy budget. Our algorithm is distinguished by segmenting a sensor network into partitions of non-overlapping regions and performing sampling and local aggregation for each region. Comprehensive experiments on real-world data sets indicate that our approach is at a minimum of 10% more accurate compared with the previously proposed solutions.

In [13] we propose a two-level sampling approach focusing on peer-to-peer databases for maximizing sample quality given a user-defined communication budget. Given that individual peers may have varying cardinality we propose an algorithm for determining the optimal sample rate (the percentage of tuples to sample from a peer) for each peer. We do this by analyzing the variance of individual peers, ultimately minimizing the total variance of the entire sample. By performing local optimization of individual peer sample rates we maximize approximation accuracy of the samples. We also offer several techniques for sampling in peer-to-peer databases given various levels of known information and unknown information about the network and its peers.

In [14] and [15] we focus on time series and provide techniques for approximate similarity search and summarization respectively. In particular, in [14] we introduced an Embedding-Based Subsequence Matching (EBSM) method for approximate subsequence matching. EBSM significantly improves the efficiency of subsequence matching in large time series data sets under the dynamic time warping (DTW) distance measure.

Typically work on time-series representation has concentrated on approaches that are calculated in batch mode and represent each value with approximately equal fidelity. However, the increasing deployment of mobile devices and real-time sensors requires representations that can be incrementally updated and can approximate the data with fidelity proportional to its age. The latter property allows us to answer queries about the recent past with greater precision, since in many domains, recent information is more useful than older information. We call such representations "amnesic". In [15] we introduce a novel representation of time series that can represent arbitrary user-specified amnesic functions. For example, a meteorologist may decide that data that is twice as old can tolerate twice as much error and thus specify a linear amnesic function. In contrast, an econometrist might opt for an exponential amnesic function. Amnesic functions correspond to yet another dimension of functional attributes that this project deals with. We propose online algorithms for our representation and discuss their properties. Finally, we perform an extensive empirical evaluation on 40 data sets and show that our approach can efficiently maintain a high-quality amnesic approximation.

We continued our collaboration with ESRI. In [16] we describe a flexible versioning scheme for network models (which describe the connectivity between spatial features in GIS applications). It turns out that the standard database mechanisms for concurrency control, which include transactions and locking protocols, do not provide the support needed for updating geographic data. The preferred method to resolve conflicts in GIS systems is the use of multiple versions of the data to encapsulate the modifications generated by the end users. Our solution is based on marking "dirty" areas and "dirty" objects (regions and objects which contain conflicts between multiple users) and subsequent cleaning them.

In [17] and [18] we describe distributed techniques for managing spatial and temporal data and streams. During this period we co-authored four new papers for the Encyclopedia of Geographic Information Science. In particular, [19] examines ways to discover similar trajectories, [20] summarizes work on indexing spatiotemporal archives, [21] presents a survey of extensible index structures, while [22] is an introduction to R-trees.

Research Activities - Year 3:

During the third year of the project we focused on computing functional aggregate queries as well as finishing and testing the prototype of our query framework. We first examined the problem of computing spatial aggregates. Given a collection of spatial objects and a query region, the spatial aggregation query asks for some aggregate value of the objects in the query region. Our solutions apply to the SUM, COUNT and AVERAGE aggregates. To find the total value of objects in the query region, a straightforward approach is to find the objects in the query region and then compute the aggregate value on-the-fly. This approach is inefficient especially when the query region is large and thus contains many objects. Our main contribution of is a novel index structure called BA-tree and an optimized version of it, the BAu-tree, whose efficiency is robust in both non-LRU-buffered and LRU-buffered situations. This solution assumes point objects. We further show how the case when objects have extent can be reduced to the point aggregation problem. When considering objects with extents, there are two variations. In the first variation (simple box-sum), the aggregation query asks for the total value of objects intersecting the query region. In the second variation (functional box-sum), each object participates in the aggregation proportionally to the size of its intersection with the query region. In both cases, the queries are reduced to point aggregations and then solved by the BA-tree. This work appears in [23].

During the third year of the project we continued our work on the spatiotemporal query framework that we started in year 2. In particular, we enhanced our approaches for computing density based queries [11] like flocks, i.e., objects that move within a given distance from each other for a specific time interval. This query belongs to the general category of group-aggregation spatiotemporal queries. An interesting finding for this year is that we proved that computing flocks on-line can be done in polynomial time (previous work by others have suggested exponential algorithms even for the on-line case). Based on these findings we have proposed a generic framework to compute flocks and proposed five algorithms; this work has been accepted to GIS’09 [11].

Furthermore, we considered the problem of finding the k highest-ranked (or Top-k) answers in a distributed sensor network. Our study focuses on multi-hop distributed networks in which the data is accessible by traversing a network of nodes. Such a setting captures very well the computation framework of emerging Sensor Networks, Peer-to-Peer Networks and Vehicular Networks. In [24] we presented the Threshold Join Algorithm (TJA), an efficient algorithm that utilizes a non-uniform threshold on the queried attribute in order to minimize the transfer of data when a query is executed. Our results indicate that TJA requires an order of magnitude less communication than the state-of-the-art, scales well with respect to the parameter k and the network topology.

In [25] we presented a new, distributed, query evaluation algorithm for Motion Pattern Queries. Distributed techniques are particularly important when the data is stored in different physical locations, including possibly stored in mobile devices (laptops, smart-phones, digital cameras) that have been used to collect the data in the field. Our work aims specifically at minimizing the cost of answering snapshot multi-predicate queries in high-communication-cost networks, such as wireless sensor networks or wireless networks of smart-phones. In such networks transmitting data is very demanding in resources, mainly due to energy (battery) constraints. We showed that minimizing the communication cost for multi-predicate queries is NP-hard. We proposed a dynamic programming algorithm to compute the optimal solution for small problem instances. We also proposed a low complexity, approximate, heuristic algorithm for solving larger problem instances efficiently and running it on nodes with low computational power (e.g. sensors). Finally, we presented a variant of the Fermat point problem where distances between points are minimal paths in a weighted graph, and proposed a solution. Using an extensive experimental evaluation that compared the proposed algorithms to the best known technique used to evaluate queries in wireless sensor networks we found an improvement of 10% up to 95%. The low complexity heuristic algorithm is also shown to be scalable and robust to different query characteristics and network size.

In [26] we explored applications and extensions of our work on temporal data analysis to the problem of identifying events in streams of related documents, such as blogs, or historical newspaper archives over a large period of time. The articles in such collections cover newsworthy events that took place at various times. Each event is characterized by a set of descriptive keywords, revealing basic information such as the place where the event occurred, or the names of the persons involved. For the duration of the event’s lifespan and consequent coverage in the news, these characteristic terms appear repeatedly in relevant articles, leading to uncommonly high frequencies (bursts). Term burstiness has been used as a mechanism to address event detection in the context of such collections. We explored how burstiness information can be further utilized to enhance the search process. We have presented a novel approach to model the burstiness of a term, using discrepancy theory concepts. This allows us to build a parameter-free, linear-time approach to identify the time intervals of maximum burstiness for a given term.

We have also continued our work on indexing techniques for finding subsequences in time series, and in [27] we introduced a novel method, called Reference-Based String Alignment (RBSA), that speeds up retrieval of optimal subsequence matches in large databases of sequences under the edit distance and the Smith-Waterman similarity measure. RBSA operates under the assumption that the optimal match deviates by only a relatively small amount from the query, an amount that does not exceed a pre-specified fraction of the query length. RBSA has an exact version that guarantees no false dismissals and can handle large queries efficiently, outperforming the current state-of-the-art methods. An approximate version of RBSA is also described, that achieves significant additional improvements over the exact version, with negligible losses in retrieval accuracy. RBSA performs filtering of candidate matches using pre-computed alignment scores between the database sequence and a set of fixed-length reference sequences. At query time, the query sequence is partitioned into segments of length equal to that of the reference sequences. For each of t hose segments, the alignment scores between the segment and the reference sequences are used to efficiently identify a relatively small number of candidate subsequence matches. An alphabet collapsing technique is employed to improve the pruning power of the filter step. The experimental evaluation shows that, for query lengths ≥ 200, RBSM outperforms current state-of-the-art sequence alignment methods: BLAST2, BWT-SW and q-grams. Speedups of one to two orders of magnitude over the current state of the art are demonstrated for query sizes ≥ 2, 000.

In [28] we have given a new algorithm for clustering data while incorporating user constraints. The idea of our approach is motivated from physics concepts. The main advantage of the technique that we propose is that is very general: it can be applied to the case of vector data (that is the case that each object is defined as a vector of characteristics) and graph data (when we do not have an explicit representation of the objects, but just a function that defines the similarity or the distance of two objects).

Finally, in [29] we presented a novel way to perform filtering over streaming documents in a publish/subscribe system, using specialized hardware, based on Field Programmable Gate Arrays (FPGAs). Our approach showed drastic throughput improvement (orders of magnitude) over plain software approaches. Since our pattern query language for trajectories has similarities with XPath, we are currently exploring the use of FPGAs for querying streaming trajectories.

During the same period we also co-authored various entries related to indexing and spatial data for the Encyclopedia of Databases (published by Springer).

Research Activities – No-cost Extension Period:

For the no-cost extension period we continued our work on complex queries over spatiotemporal trajectories [10], in particular allowing the user to query trajectories using flexible patterns. Queries are expressed using a regular-expression like language over a set of predefined regions. In addition, queries can contain variables that can be combined with fixed predicates. A demo based on this work is in its final stages of implementation.

Moreover, we collaborated with researchers from Telefonica Research in Madrid, Spain, on using our flexible pattern spatiotemporal queries environment to analyze patterns in mobile call databases (using data provided by Telefonica). Call Detail Record (CDR) databases contain many millions of records with information about mobile phone calls, including the users' location when the call was made/received. This huge amount of spatio-temporal data opens the door for the study of human trajectories on a large scale without the bias that other sources, like GPS or WLAN networks, introduce in the population studied. We presented the Spatio-Temporal Pattern System (STPS) to query spatio-temporal patterns in very large CDR databases. STPS uses a regular-expression query language that is intuitive and that allows for any combination of spatial and temporal predicates with constraints, including the use of variables. The design of the language takes into consideration the layout of the areas being covered by the cellular towers, as well as "areas" that label places of interest (e.g. neighborhoods, parks, etc). A full implementation of the STPS is currently running with real, very large CDR databases at Telefonica Research Labs. An extensive performance evaluation of the STPS shows that it can efficiently find very complex mobility patterns in large CDR databases. This work appears in [30].

We also worked on another novel, location-based spatiotemporal problem, that of continuously monitoring the top-k “unsafe” moving objects. The “unsafety” of an object (protectee, e.g., cash transport van, etc.) is defined by the difference between its safety requirement and the protection provided by protection forces (protectors, e.g., police cars, policemen, etc.) around it. Compared with the traditional top-k queries where the score of an object represents its own characteristics, the above query describes the relationships between protectees and protectors, which introduces computational challenges since naively all objects should be inspected to answer such a query. To avoid this, we offered two efficient algorithms, namely the GridPrune and GridPrunePro, based on the basic pruning technology BoundPrune. Experiments show that the proposed algorithms outperform the naive solution with nearly two orders of magnitude on I/O savings. This work has been submitted for publication [31].

In [32] we examined the problem of identifying influential events from spatiotemporal streams of news data. This is an extension of our previous work in [26]. Consider a set of sources of streaming documents, placed on fixed locations on a 2-dimensional map. These text streams can be thought of as daily news-sites or blogs from different countries that record significant events from their respective locations. Each event (or topic) is characterized by a set of terms that describe its different aspects (e.g. descriptive keywords or names of places and people involved). During the timeframe in which an event is covered in the streams, such terms are used repeatedly in the relevant documents that appear in the mainstream and social media. Intuitively, highly influential events will be reported by numerous sources across the map, for extended timeframes. On the other hand, less significant events can have a more localized impact and shorter lifespan. In our work, we first formalize and address the problem of tracking the spatiotemporal burstiness of individual terms. We then show how we can index this knowledge to efficiently evaluate multi-term queries and retrieve matching documents that discuss influential events with a significant spatiotemporal impact. This work has been submitted for publication.

For our second objective, we worked in collaboration with colleagues from AT&T Research and University of San Paolo in Brazil, on the problem of finding the most representative (or diverse) answers from a query result. In particular, various spatial or spatiotemporal queries (e.g. k-NN, skylines, etc.) may return large answer sets. Typically a user may be interested in the most representative of those answers. In [33] we describe a general framework for evaluation and optimization of methods for diversifying query results. In these methods, an initial ranking answer set produced by a query is used to construct a result set, where elements are ranked with respect to relevance and diversity features, i.e., the retrieved elements should be as relevant as possible to the query, and, at the same time, the result set should be as diverse as possible. Using the above framework, we adapted, implemented and evaluated several existing methods for diversifying query results. We also proposed two new approaches, namely the Greedy with Marginal Contribution (GMC) and the Greedy Randomized with Neighborhood Expansion (GNE) methods. Both methods iteratively construct a result set using a scoring function that ranks candidate elements using not only relevance and diversity to the existing result set, but also accounts for diversity against the remaining candidates. We examined all methods' performance with respect to precision, running time and quality of the result. Our experimental results show that while GNE and GMC have higher running times, they achieve precision very close to the optimal, while also providing the best result quality. This work has been submitted for publication.

In collaboration with an ESRI researcher (Michael Rice) we addressed another problem related to spatiotemporal queries, namely how to address shortest paths over routes with label restrictions. The current widespread use of location-based services and GPS technologies has recently revived interest in very fast and scalable shortest path queries. We introduced a new shortest path query type in which dynamic constraints may be placed on the allowable set of edges that can appear on a valid shortest path (e.g., dynamically restricting the type of roads or modes of travel which may be considered in a multimodal transportation network). We formalize this problem as a specific variant of formal language constrained shortest path problems, which we call the Kleene Language Constrained Shortest Paths problem. To efficiently support this type of dynamically constrained shortest path query for large-scale datasets, we extend the hierarchical graph indexing technique known as Contraction Hierarchies. Our experimental evaluation using the North American road network dataset (with over 50 million edges) shows an average query speed and search space improvement of over three orders of magnitude compared to the naive adaptation of the standard Dijkstra’s algorithm to support this query type. We also show an improvement of over two orders of magnitude compared to the only previously-existing indexing technique which could solve this problem without requiring additional preprocessing. This work is currently under revision [34].

In [35] we continued our work on optimal operator placement for distributed queries over a network [25]. In particular, we presented an optimal distributed algorithm to adapt the placement of a single query operator in high communication cost networks, like wireless sensor networks. Our parameter-free algorithm finds the optimal node to host the operator with minimum communication cost overhead. When no flooding is needed the communication cost overhead for adapting the operator placement is negligible. In addition, our algorithm does not require any extra communication cost while the query is executed. In our experiments we show that for the rest of cases our algorithm saves 30%-85% of the energy compared to previously proposed techniques. To our knowledge this is the first optimal and distributed algorithm to solve the 1-median (Fermat node) problem.

Finally in [36] we proposed a distributed algorithm to construct a balanced communication tree that serves in gathering data from the network nodes to a sink. Our algorithm constructs a near-optimally balanced communication tree with minimum overhead. The balancing of the node degree results in the minimization of packet collisions during query execution that would otherwise require numerous retransmissions and reduce the lifetime of the network. We compare our simple distributed algorithm against previous work and a centralized solution and show that for most network layouts it outperforms the competition and achieves tree balance very close to the centralized algorithm. It also has the smallest energy overhead possible to construct the tree, increasing the lifetime of the network even more.


Vassilis J. Tsotras: "Recent Advances on Querying and Managing Trajectories", SSTD 2007 Keynote Speech; slides available at:

Vassilis J. Tsotras, "Complex Pattern Queries over Spatiotemporal Trajectories", Keynote Speech, 7th ACM International Workshop on Data Engineering for Wireless and Mobile Access (MobiDE'08), Vancouver, Canada, June 13, 2008.

Vassilis J. Tsotras, “Trajectories are Here to Stay”, Keynote Speech, Space-Time Modeling and Analysis Workshop, GIS Week, Redlands, February 2010.

Project References:

[1] Donghui Zhang, Alexander Markowetz, Vassilis J. Tsotras, Dimitrios Gunopulos, Bernhard Seeger, "On Computing Temporal Aggregates with Range Predicates", ACM TODS, 33(2): (2008) [pdf].

[2] Demetrios Zeinalipour-Yazti, Song Lin, Dimitrios Gunopulos, "Distributed Spatio-Temporal Similarity Search", the 15th ACM Conference on Information and Knowledge Management (CIKM), Arlington, VA, November 2006 [link].

[3] Petko Bakalov, Vassilis J. Tsotras, "Continuous Spatiotemporal Trajectory Joins", 2nd International Conference on Geosensor Networks (GSN'06), Boston, MA, October 1-3, 2006 [pdf].

[4] Petko Bakalov, Eamonn Keogh, Vassilis J. Tsotras, "TS2-tree - an efficient similarity based organization for trajectory data". ACM GIS 2007, p. 58 (poster paper; best poster runner-up) [pdf].

[5] Petko Bakalov, Vassilis J. Tsotras, "A Generic Framework for Continuous Motion Pattern Query Evaluation". ICDE 2008, pp: 80-89 [pdf].

[6] Petko Bakalov, Erik Hoel, Wee-Liang Heng, Vassilis J. Tsotras, "Maintaining Connectivity and Versioning in Dynamic Multimodal Networks". ICDE 2008, pp: 1267-1276 [pdf].

[7] Song Lin, Benjamin Arai, Dimitrios Gunopulos, "Reliable Hierarchical Data Storage in Sensor Networks", 19th International Conference on Scientific and Statistical Database Management (SSDBM), Banff, Canada, to appear, July 2007 [link].

[8] Song Lin, Demetrios Zeinalipour-Yazti, Dimitrios Gunopulos, "Top-k Retrieval Techniques in Distributed Sensor Systems", Encyclopedia of Geographical Information Science, Springer, 2008.

[9] George Kollios, Dimitrios Gunopulos, Vassilis J. Tsotras, "Mobile Object Indexing", Encyclopedia of Geographical Information Science, Springer, 2008.

[10] Marcos Vieira, Petko Bakalov, Vassilis J. Tsotras, "Querying Trajectories Using Flexible Patterns". Proceedings of the 13th International Conference on Extending Database Technology (EDBT), pp: 406-417, 2010.

[11] Marcos Vieira, Petko Bakalov, Vassilis J. Tsotras, "On-Line Discovery of Flock Patterns in Spatio-Temporal Data". ACM GIS 2009; Seattle, Washington, to appear [pdf].

[12] Song Lin, Benjamin Arai, Dimitrios Gunopulos, Gautam Das, "Region Sampling: Continuous Adaptive Sampling on Sensor Networks". ICDE 2008: 794-803 [link].

[13] Benjamin Arai, Song Lin, Dimitrios Gunopulos, "Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks". ICDM 2007: 23-32 [link].

[14] Vassilis Athitsos, Panagiotis Papapetrou, Michalis Potamias, George Kollios, Dimitrios Gunopulos, "Approximate embedding-based subsequence matching of time series". SIGMOD Conference 2008: 365-378 [link].

[15] Themis Palpanas, Michail Vlachos, Eamonn J. Keogh, Dimitrios Gunopulos, "Streaming Time Series Summarization Using User-Defined Amnesic Functions". IEEE Trans. Knowl. Data Eng. 20(7): 992-1006 (2008) [link].

[16] Petko Bakalov, Erik G. Hoel, Sudhakar Menon, Vassilis J. Tsotras, "Versioning network models in a multiuser environment". SSTD 2009: 6-24, Aalborg, Denmark, July 8-10, 2009 [pdf].

[17] Vana Kalogeraki, Dimitrios Gunopulos, Ravi S. Sandhu, Bhavani M. Thuraisingham, "QoS Aware Dependable Distributed Stream Processing". ISORC 2008: 69-75 [link].

[18] Themis Palpanas, Vana Kalogeraki, Dimitrios Gunopulos, "Online Distribution Estimation for Streaming Data: Framework and Applications". SEBD 2007: 430-438 [link].

[19] George Kollios, Michail Vlachos, Dimitrios Gunopulos, "Discovering Similar Trajectories". Encyclopedia of Geographical Information Science, Springer, 2008, pp: 1168-1173.

[20] Marios Hadjieleftheriou, George Kollios, Vassilis J. Tsotras, Dimitrios Gunopulos, "Indexing Spatio-temporal Archives". Encyclopedia of Geographical Information Science, Springer, 2008, pp: 530-538.

[21] Marios Hadjieleftheriou, Erik G. Hoel, Vassilis J. Tsotras, "Extensible Index Structures". Encyclopedia of GIS Geographical Information Science, Springer, 2008, pp: 483-493.

[22] Marios Hadjieleftheriou, Yannis Manolopoulos, Yannis Theodoridis, Vassilis J. Tsotras, "R-Trees - A Dynamic Index Structure for Spatial Searching". Encyclopedia of Geographical Information Science, Springer, 2008, pp: 993-1002.

[23] Jarod Wen, Donghui Zhang, Vassilis J. Tsotras, Dimitrios Gunopulos, “On Computing Spatial Aggregates”, under review.

[24] Demetrios Zeinalipour-Yazti, Zografoula Vagena, Vana Kalogeraki, Dimitrios Gunopulos, Vassilis J. Tsotras, Michail Vlachos, Nick Koudas, Divesh Srivastava: “Finding the K highest-ranked answers in a distributed network”, Computer Networks Journal, 53(9): 1431-1449 (2009) [pdf]

[25] Georgios Hatzimilioudis, Huseyin Hakkoymaz, Nikos Mamoulis, Dimitrios Gunopulos, “Operator Placement for Multi-Predicate Snapshot Queries in Wireless Sensor Networks”, 10th International Conference on Mobile Data Management: Systems, Services and Middleware (MDM 2009) [link]

[26] Theodoros Lappas, Benjamin Arai, Manolis Platakis, Dimitris Kotsakos, Dimitrios Gunopulos, “On Burstiness Aware Search for Document Sequences”, 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD09) [link]

[27] Panagiotis Papapetrou, Vassilis Athitsos, George Kollios, Dimitrios Gunopulos, “Reference-Based Alignment in Large Sequence Databases”, VLDB 2009 Conference, Lyon, France [pdf].

[28] Huseyin Kakkoymaz, Georgios Hatzimilioudis, Dimitrios Gunopulos, Heikki Mannila, “Applying Electromagnetic Field Theory Concepts to Clustering with Constraints”, ECML/PKDD 2009 [link].

[29] Abhishek Mitra, Marcos R. Vieira, Petko Bakalov, Vassilis J. Tsotras, Walid A. Najjar: Boosting XML filtering through a scalable FPGA-based architecture. CIDR 2009 [pdf].

[30] Marcos R. Vieira, Enrique Frías-Martínez, Petko Bakalov, Vanessa Frías-Martínez, Vassilis J. Tsotras, “Querying Spatio-temporal Patterns in Mobile Phone-Call Databases,” Proceedings of the 11th International Conference on Mobile Data Management (MDM), pp.239-248, 2010.

[31] Jian Wen, Vassilis J. Tsotras: “Continuously Monitoring Top-K Unsafe Moving Objects,” submitted for publication.

[32] Theodoros Lappas, Marcos Vieira, Dimitris Gunopulos, Vassilis J. Tsotras: “Searching for Documents on Influential Events through Space and Time,” submitted for publication.

[33] Marcos R. Vieira, Humberto L. Razente, Maria C. N. Barioni, Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina Jr., Vassilis J. Tsotras: “On Query Result Diversification,” submitted for publication.

[34] Michael Rice, Vassilis J. Tsotras: “Graph Indexing for Fast and Scalable Shortest Path Queries with Label Restrictions,” submitted for publication.

[35] Georgios Chatzimilioudis, Nikos Mamoulis, Dimitrios Gunopulos: “A Distributed Technique for Dynamic Operator Placement in Wireless Sensor Networks,” Proceedings of Mobile Data Management (MDM), pp: 167-176, 2010. [link].

[36] Georgios Chatzimilioudis, Demetrios Zeinalipour-Yazti, Dimitrios Gunopulos: “Minimum-Hot-Spot Query Trees for Wireless Sensor Networks,” Proceedings 9th International ACM Workshop on Data Engineering for Wireless and Mobile Access (MobiDE), 2010 [pdf].


Point of Contact: Vassilis J. Tsotras

Date of Last Update: August 23rd, 2010