**"Discovering Hidden Semantics from Spatio-temporal Sensed Data"**

Funded by the National Science Foundation

PI: Vassilis J. Tsotras
Co-PI: Chinya Ravishankar
Award Number: IIS-1527984

Duration: 09/01/2015 through 08/31/2018

Students:
Pritom Ahmed
Payas Rajan
Elena Strzheletska

Project Summary:

Spatiotemporal data are rich in semantics and embed a great deal of high-level information on the behavior of moving objects as well their interactions with objects in the environment and with each other. To adequately reveal such semantics, this project has set three research goals in the spatiotemporal domain.

The first goal is to study queries that elicit the semantics of interactions between moving objects and their environment; an example of such a novel query is the `dwell region' query. A region R is a dwell region for a moving object O if, given a threshold distance d and duration t, every point of R remains within distance d from O for at least time t. Clearly, points within R are likely to be of interest to O, so identification of O and R has applications in areas such as monitoring and surveillance (i.e., identify the region that is under surveillance by a moving object).

The second goal is to address queries eliciting the semantics of interactions, between moving objects themselves. One example is the `assembly' query which identifies potential meeting points between moving objects when their trajectories are not known fully or precisely. Another example is addressing complex `reachability' queries, which elicit possible interactions between objects via intermediaries.

A third goal of this project is to identify efficient solutions that are also practical. In that sense, our solutions should scale. An important related question is then how to efficiently index trajectory data. Another question is whether parallelism can be used to expedite the above queries.

Research Activities: Year 1

During the first year of this project we worked on the following:

(1) Dwell Region Queries. A Dwell Region query enables users identify new regions of interest, based on a moving object's spatio-temporal behavior. In [1] we first proposed an online algorithm to solve this problem, which can handle dynamic additions and deletions of moving object locations in logarithmic time. In this environment we assume an incoming stream of object positions, and maintain the upper and lower bounds for the radius of the smallest circle enclosing these positions, as points are added and deleted. These bounds allow us to greatly reduce the number of trajectory points we need to consider in the query, as well as to defer query evaluation. Our method can approximate the radius of the smallest circle enclosing a given subtrajectory within an arbitrarily small user-defined factor. As a byproduct, we solve efficiently a decision query of whether a dwell region exists or not. This query has applications to the trajectory simplification problem. We also examine the offline version of the problem and propose a preprocessing method that enables fast dwell region queries on archived trajectories. Using the preprocessed information dwell regions can be identified by only looking into a very small part of the archived data. Our experiments using real trajectory datasets show that the online approach can scale up to hundreds of thousands of moving objects for streaming data. For archived trajectories, the preprocessing approach gives orders of magnitudes of speedup in query evaluation time.

(2) Assembly Queries. Consider objects moving in a road network (e.g., groups of friends), who may be free to choose routes, yet be required to arrive at certain locations at certain times (preexisting requirements). Such objects may need to assemble in groups within the network (friends meet while all visiting the same city) without violating their preexisting constraints. Planning for such assemblies is hard when the network or the number of objects is large. Conversely, discovering actual or potential assemblies of such objects is important in many surveillance, security, and law-enforcement applications. This can be hard when object arrival observations are sparse due to inadequate sensor coverage or object countermeasures. In [2] we proposed the novel class of Assembly Queries to model these scenarios, and presented a unified scheme that addresses both of these complementary challenges. Given a set of objects and arrival constraints, we show how to first obtain the set of all possible locations visited by each moving object (the travel corridor), and then determine all possible assemblies, including the participants, locations, and durations. We present a formal model for various tracking strategies and several algorithms for using these strategies. We achieve excellent performance on these queries by preprocessing the network, using Contraction Hierarchies. Experimental results on real-world road networks show that we can efficiently and rapidly infer assembly information for very large networks and object groups. When compared with the traditional Dijkstra-based approaches our method runs 1-2 orders of magnitude faster.

(3) Spatiotemporal Reachability Queries. These queries arise naturally when determining how diseases, information, malware, physical items, etc. can propagate through a collection of moving objects; hence such queries are significant for many important domains like epidemiology, public health, security monitoring, surveillance, and social networks. While traditional reachability queries have been studied in graphs extensively, what makes spatiotemporal reachability queries different and challenging is that the associated graph is dynamic and space-time dependent. Moreover, previous spatiotemporal reachability work assumes an `instant transfer' scenario (where a node is considered to be instantly reachable from another node as long as there is a directed path to it, independently of the length and the number of intermediate nodes on that path), which may not be the case in many real world applications. Further, in [3] we assume that the collection of objects is large, and their behavior is considered over an extensive period of time; hence the dataset becomes very large itself, and does not fit in main memory. As a result an efficient solution needs to be disk I/O efficient. To solve this problem we proposed the RICC (Reachability Index Construction by Contraction) approach. We tested our algorithm on two types of realistic datasets using queries of various temporal length and different types (with single and multiple sources and targets). The results of our experiments show that RICC can be efficiently used for answering a wide range of spatiotemporal reachability queries on disk-resident datasets. When compared with the previous best known approach, RICC offers drastic reduction on the I/O (up to 80%).

(4) Investigative Search over Social Posts. The amount of user generated data increases every year, as more social interaction tools like Twitter, Instagram and Facebook are being created and more users use them to share their everyday experiences. Most research on analyzing social data has focused on detecting trends and patterns such as bursty topics and popular spatiotemporal paths, event extraction, studying how information is spread, or analyzing properties of the social graph. In contrast, and given the spatiotemporal nature of many of these social posts, we are interested in how to search social network data items, posts and images, based on spatiotemporal keyword queries. That is, in [4] we created methods to perform investigative search, in contrast to the trending queries studied by previous work. Investigative search can also be viewed as exploring the currently untapped long tail of the distribution of topics in social networks. This work led to the creation of OSNI (for Online Social Network Investigator) a scalable distributed system to search social network data, based on a spatiotemporal window and a list of keywords.

Research Activities: Year 2

During the second year of this project we worked on the following:

(1) Assembly Queries. We continued and revised our work and formally defined three assembly queries. We first defined the more general [γ, τ]-assembly query, which returns the set of assembly nodes at which at least γ objects could have met for at least τ time units. Another important query is the [top-k(γ), τ] query, that finds the k assembly nodes with the largest possible meeting sizes. Finally we defined the [γ, top-k(τ )] query, that reports the k assembly nodes with the longest-possible meeting durations. In the original work we had discussed the concepts of availability intervals, corridors, and assemblies with respect to nodes only, i.e., not along edges. However, it is likely that objects would actually meet at a location somewhere along the midspan of a road (edge), e.g., at a restaurant, mall, gas station, etc. We presented a way to address this case as well. The paper [2] was presented at the ACM SIGSPATIAL 2017 Conference.

(2) Top-k Frequent Terms over Spatial Ranges. The wide availability of tracking devices has drastically increased the role of geolocation in social networks, resulting in new commercial applications; using geolocation, marketers can identify current trending topics within some region of interest and focus their products accordingly. In [5] we studied a novel analytics query on geotagged data, where given a spatial area we find the most frequent terms among the social posts in that area (kFST query). While there has been prior work on keyword search on spatial data (finding the objects located nearest to the query region and containing the given keywords), and on group keyword search on spatial data (retrieving groups of objects), our problem is different in that it finds the most frequent terms without any given set of query keywords. We proposed a novel index structure and algorithms to efficiently answer such spatial aggregation queries. Our index structure employs an R-tree augmented by top-k sorted term lists (STLs), where a key challenge is to balance the size of the index to achieve faster execution and smaller space requirements. We theoretically studied and experimentally validated the ideal length of the stored term lists, and performed detailed experiments to evaluate the performance of the proposed methods compared to baselines on real datasets. We further extended our algorithms to also cover the multi-region kFST problem and showed that our multi-region approach is more efficient than simply running the single-region algorithm multiple times. The full paper was presented in ACM SIGMOD 2017.

(3) Reachability with Meetings. Previous works on spatiotemporal reachability queries assume that information or physical items can be passed from one object to another either instantaneously or after some processing delay. There are applications however, where we also need to account for the transfer of information; doing so implies that the interacting objects would be forced to stay in contact for some time. This requirement makes the query processing even more challenging. In this work, we introduced a novel problem, namely, spatiotemporal reachability queries with transfer delays (`meetings'). For answering reachability queries with meetings, we proposed two algorithms, namely, RICCmeetMin and RICCmeetMax. To prune the search space during query time, these algorithms precompute some reachability events: the shortest valid (RICCmeetMin), and the longest possible (RICCmeetMax) meetings respectively. An extended experimental evaluation examines the efficiency and pruning characteristics of both RICCmeet algorithms for answering a variety of spatiotemporal reachability queries with meetings on large disk-resident datasets. The presented algorithms were about 3 times faster than the previous approach. Our work [6] appeared at the ACM SIGSPATIAL 2017 Conference.

(4) Indexing Trajectories with Space Filling Curves. A classic problem for large trajectory repositories is how to efficiently index and query these trajectories. We consider a trajectory as a `polyline' whose endpoints correspond to a time-ordered sequence of locations (points) visited by a moving object (typically provided by sensor readings, user check-ins etc.) Common queries on trajectories are variations of spatial ranges and nearest neighbors. Traditionally, spatial objects are first abstracted by minimum bounding rectangles (MBRs) which are then inserted in R-trees (or their variations). However, using MBRs to index trajectories introduces large dead space and thus a high number of false positives for queries. Space filling curves (SFC) have been used extensively for indexing points in two and higher dimensional spaces. Their advantage arises from their locality preservation and dimensionality reduction. Extending SFCs to index segments instead of points is not trivial as it may lead to false negatives and/or higher query times. In our work we examine how to use SFCs to index trajectory polylines. Our initial experimental results show that our proposed method gives 2 to 15 times improvement for range queries over the state of the art methods. Our work [7] appeared at the ACM SIGSPATIAL 2018 Conference

(5) Massively Parallel Skyline Queries. Processing-In-Memory (PIM) is an increasingly popular architecture aimed at addressing the `memory wall' crisis by prioritizing the integration of processors within DRAM. The skyline operator is a known primitive used to identify those multi-dimensional points offering optimal trade-offs within a given dataset. For large scale data, calculating the skyline set is extensively compute and data intensive. PIM architectures offer many opportunities to mitigate this cost. However, their execution model relies on all processors operating in isolation with minimal data exchange. This prohibits direct application of known skyline optimizations which are inherently sequential, creating dependencies that limit the maximum parallelism and throughput. It is thus challenging to introduce a well design parallel skyline algorithm for prominent PIM architectures. In this work, we address these challenges by introducing a new parallel skyline algorithm called DSky. It is designed to be massively parallel and throughput efficient by leveraging a novel work assignment strategy that emphasizes on load balancing. Our experiments demonstrate that DSky outperforms in most cases the state-of-the-art algorithms developed for the CPU and GPU. For roughly the same work, our solution [8] achieves an order of magnitude higher throughput and 75% of the peak system throughput.

Research Activities: Year 3

During the third year of this project we worked on the following:

(1) Reachability with Decay: We considered reachability queries under the `information decay' assumption. A spatiotemporal reachability query identifies whether an item could have been transferred from the source object to the target object during a time interval (directly, or through intermediate transfers). In previous research, the weight of a piece of transferred information remains the same. Our work introduces a novel reachability query, which assumes a scenario of information decay (when the value of information that travels through the chain of intermediate objects decreases with each transfer). This in turn leads to an interesting extension: if there are many different sources of information, the aggregate value of information an object can collect varies. As a result, under the decay scenario, we introduce a top-k reachability problem, identifying the k objects with the highest accumulated information. This work led to a submitted paper [9].

(2) Electric Vehicle Routing: We worked on the problem of routing electric vehicles in the presence of uncertainties arising from traffic and other unpredictable phenomena. Electric Vehicle (EV) battery capacity is limited, so EV routing must trade off travel time for energy consumption, which grows quadratically with speed. Current multi-parameter EV routing methods assume accurate estimates of time and energy consumed, but current models for obtaining these estimates cannot capture this time-energy tradeoff in a sufficiently flexible way. In [10] we present a new approach to EV modeling that addresses such shortcomings using a new structuring abstraction for vehicle speed profiles called 'phases', which models energy consumption accurately at lower temporal granularity. Using 52 hours of driving data collected on a Nissan Leaf, we showed that our model achieves a per-trip accuracy better than even that of current microscopic models and generates speed profiles that accurately model time and energy consumption at the trip level.

(3) Time Decaying Bloom Filters: Current Bloom Filters tend to ignore Bayesian priors as well as a great deal of useful information they hold, compromising the accuracy of their responses. Incorrect responses cause users to incur penalties that are both application- and item-specific, but current Bloom Filters are typically tuned only for static penalties. Such shortcomings are problematic for all Bloom Filter variants, but especially so for Time-decaying Bloom Filters, in which the memory of older items decays over time, causing both false positives and false negatives. We address these issues by introducing inferential filters, which integrate Bayesian priors and information latent in filters to make penalty-optimal, query-specific decisions. We also show how to properly infer insertion times in such filters. While our methods are general, we illustrate their application to inferential time-decaying filters to support novel query types and sliding window queries with dynamic error penalties. In [11] we present inferential versions of the Timing Bloom Filter and Generalized Bloom Filter. Our experiments on real and synthetic datasets show that our methods reduce penalties for incorrect responses to sliding-window queries in these filters by up to 70% when penalties are dynamic.

(4) The α-happiness query: When faced with a large database containing millions of products, a user may be interested in a (typically much) smaller representative subset. Various approaches have been researched over the years to create a good representative subset that fits the user’s needs which are expressed through a utility function (e.g., top-k queries and diversification queries). The recently proposed k-regret query does not require a utility function from the users and has a bounded output size as it returns a set of at most k tuples that minimizes a criterion, called the maximum regret ratio; in a sense this query finds the best subset of k tuples that minimizes the user regret (measured by how far this subset is from the ideal subset). Instead, we proposed the α-happiness query, which can be thought as a dual problem to the k-regret query. That is, we find the smallest subset of tuples while guaranteeing that the minimum happiness ratio of the users is at least α (i.e., the least tuples needed to keep the users happy at a given level). In order to answer an α-happiness query, we formulate it as a constrained optimization problem, and explain it from a geometric perspective. We proposed several algorithms for the α-happiness query with theoretical guarantees under different scenarios. Extensive experiments on both real and synthetic datasets show that our proposed algorithms perform effectively and efficiently for answering the α-happiness query by returning the least number of tuples as compared with the best known algorithms in literature [12].

Research Activities: Year 4

During the fourth year of this project we worked on the following:

(1) Tweet Geolocation in Any Language: The tweet geolocation problem (identify where a particular tweet was sent from) is very challenging; the majority of Twitter users do not geolocate their tweets. Most of previous approached worked on tweets that use English. In [13] we introduced the Unicode Convolutional Neural Network (UnicodeCNN) for analyzing text written in any language. The UnicodeCNN does not require the language to be known in advance, allows the language to change arbitrarily mid-sentence, and is robust to the misspellings and grammatical mistakes commonly found in social media. We demonstrated the UnicodeCNN’s effectiveness on the challenging task of content-based tweet geolocation using a dataset with 900 million tweets written in more than 100 languages.

(2) Image Geolocation: Existing methods for geolocating images use standard classification or image retrieval techniques. These methods have poor theoretical properties because they do not take advantage of the earth’s spherical geometry. In some cases, they require training data sets that grow exponentially with the number of feature dimensions. Our work [14, 15] introduces the Mixture of von-Mises Fisher (MvMF) loss function, which is the first loss function that exploits the earth’s spherical geometry to improve geolocation accuracy. We proved that this loss requires only a dataset of size linear in the number of feature dimensions, and empirical results showed that our method outperforms previous methods with orders of magnitude less training data and computation.

(3) In-memory Top-K selection: We developed algorithmic solutions for parallel in-memory Top-k selection. In [16] we proposed three distinct processing models that offer varying levels of parallelism. We introduced the concept of rank uncertainty used to discern (given a small representative subset of existing approaches) those having the highest potential to perform well for main memory processing. We proposed three algorithms, namely VTA, PTA, and SLA. All these methods utilize a simple and easy to maintain data structure, within a conventional DBMS, called a TBL list. PTA adopts a new strategy to minimize rank uncertainty which relies on angle space partitioning. In its scalar form, PTA exhibits several orders of magnitude better performance compared to previous works. In addition, PTA outperforms parallel variants of previous methods that utilize reordering (VTA) and layering (SLA).

(4) GPU- accelerated Top-K selection: We developed parallel threshold algorithms optimized to enable GPU accelerated Top-k selection with support for early termination. We considered two different data partitioning strategies, evaluating their effectiveness on various data distributions and query parameters. Our empirical results showcased that data preordering when combined with angle space partitioning is superior in terms of tuple evaluations compared against random partitioning [17].

Publications:

[1] M Reaz Uddin, Chinya V. Ravishankar, Vassilis J. Tsotras, "Dwell Regions: a Novel Query for Identifying Regions of Interest in Trajectory Data", submitted.

[2] M Reaz Uddin, Michael N. Rice, Chinya V. Ravishankar, Vassilis J. Tsotras, "Assembly Queries: Planning and Discovering Assemblies of Moving Objects Using Partial Information", ACM SIGSPATIAL GIS Conference, Redondo Beach, November 2017, 24:1-24:10, link.

[3] Elena V. Strzheletska, Vassilis J. Tsotras, ``RICC: Fast Reachability Query Processing on Large Spatiotemporal Datasets", SSTD 2015, pp 3-21, link.

[4] Shiwen Cheng, James Fang, Vagelis Hristidis, Harsha V. Madhyastha, Niluthpol Chowdhury Mithun, Dorian Perkins, Amit K. Roy-Chowdhury, Moloud Shahbazi, Vassilis J. Tsotras, ``OSNI: Searching for Needles in a Haystack of Social Network Data", EDBT 2016, pp: 616-619, link.

[5] Mahbub Hasan, Pritom Ahmed, Abhijith Kashyap, Vagelis Hristidis, Vassilis J. Tsotras, ``Efficient Computation of Top-k Frequent Terms over Spatial Ranges", Proceedings ACM SIGMOD Conference, Chicago, May 2017, pages 1227-1241, link.

[6] Elena V. Strzheletska, Vassilis J. Tsotras, ``RICCmeet: Efficient Processing of Novel Reachability Queries on Large Spatiotemporal Datasets", ACM SIGSPATIAL GIS Conference, Redondo Beach, November 2017, 22:1-22:10, link.

[7] Reaz Uddin, Chinya V. Ravishankar, Vassilis J. Tsotras: Indexing moving object trajectories with hilbert curves. SIGSPATIAL/GIS 2018: 416-419, link.

[8] Vasileios Zois, Divya Gupta, Vassilis J. Tsotras, Walid A. Najjar, Jean Francois Roy, ``Massively Parallel Skyline Computation For Processing-In-Memory Architectures", 27th International Conference on Parallel Architectures and Compilation Techniques, (PACT), Limassol, Cyprus, November 2018, 1:1-1:12, link.

[9] Elena V. Strzheletska and Vassilis J. Tsotras. Answering Reachability Queries with Transfer Decay, submitted.

[10] Payas Rajan and Chinya V. Ravishankar. The Phase Abstraction for Estimating Energy Consumption and Travel Times for Electric Vehicle Route Planning. Proceedings of ACM SIGSPATIAL Conf., 2019 (to appear).

[11] Jonathan L. Dautrich Jr., Chinya V. Ravishankar: Inferring Insertion Times and Optimizing Error Penalties in Time-decaying Bloom Filters. ACM Trans. Database Systems, 44(2): 7:1-7:32 (2019), link.

[12] Min Xie, Raymond Chi-Wing Wong, Peng Peng, Vassilis Tsotras. Being Happy with the Least: Achieving α-happiness with Minimum Number of Tuples. IEEE ICDE Conference 2020 (to appear).

[13] Mike Izbicki, Vassilis Tsotras, Evangelos Papalexakis. Geolocating Tweets in any Language at any Location. Proceedings of 28th ACM Intl. Conf. on Information and Knowledge Management (CIKM), 2019 (to appear).

[14] Mike Izbicki, Evangelos E. Papalexakis and Vassilis J. Tsotras. Exploiting the Earth’s Spherical Geometry to Geolocate Images. Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2019 (to appear).

[15] Mike Izbicki, Evangelos E. Papalexakis, and Vassilis J. Tsotras. The MvMF Loss for Predicting Locations on the Earth’s Surface. Proceedings of MACLEAN: MAChine Learning for EArth ObservatioN (workshop @ECML/PKDD2019), link.

[16] Vasileios Zois, Vassilis J. Tsotras, Walid A. Najjar. Efficient Main-Memory Top-K Selection For Multicore Architectures. Proceedings of the VLDB Endowment (PVLDB), Vol. 13 (2), 2019, link.

[17] Vasileios Zois, Vassilis J. Tsotras, Walid A. Najjar. GPU Accelerated Top-K Selection With Efficient Early Stopping. Tenth International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS), 2019, link.

Acknowledgement: This material is based upon work supported by the National Science Foundation under Grant No. IIS-1527984

Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.