"Collaborative Research: Making Big Data Active: From Petabytes to Megafolks in Milliseconds "

Funded by the National Science Foundation

PI: Vassilis J. Tsotras
Co-PI: Evangelos Christidis
Award Number: IIS-1447826

This is a collaborative project with Award IIS-1447720
PI: Mike Carey, University of California, Irvine
Co-PI: Nalini Venkatasubramanian, University of California, Irvine

Duration:  09/01/2014 through 08/31/2018

Ildar Absalyamov
Waleed Amjad
Steven Jacobs
Mohiuddin Qader
Shouq Sadah
Marcel Tawamba (REU Participant) 


Project Summary: 

A wealth of digital information is being generated daily through social networks, blogs, online communities, news sources, and mobile applications as well as device-based sources in an increasingly sensed world. Organizations and researchers in diverse domains recognize that tremendous value and insight can be gained by capturing this emerging data and making it available for querying and analysis. A number of domains, such as public health, national security, and public safety, could benefit greatly from timely awareness of situations as they first become observable in Big Data. There is a need to move beyond passive Big Data analytical infrastructures – Big Active Data is needed! 

First-generation Big Data management efforts yielded MapReduce-based frameworks such as Hadoop, Pig, and Hive that address after-the-fact data analytics, key-value (“NoSQL”) stores that provide scalable, keyed record storage and retrieval, and a handful of specialized systems that target problems like parallel graph analytics and data stream analysis. This project aims to continuously and reliably capture Big Data arising from social, mobile, Web, and sensed data sources and enable timely delivery of information to users with indicated interests. In a nutshell, we aim to develop techniques to enable the accumulation and monitoring of petabytes of data of potential interest to millions of end users; when “interesting” new data appears, it should be delivered to end users in a timeframe measured in (100’s of) milliseconds. 

Successful execution of this project will require the development of practical solutions to a number of interesting and important technical problems that we categorize in three Aims.

Aim 1 deals with scalable data input (”data in” problems). To offer declarative subscriptions, we envision the use of data channels to which users can subscribe. On the "data in" side, the technical challenges that we are addressing include resource management in very large scale LSM-based storage systems and the provision of a highly available and elastic facility for fast data ingestion.

Aim 2 addresses data channel and subscription management (”inside” problems). On the "inside", the challenges include the parallel evaluation of a large number of declarative data subscriptions over (multiple) highly partitioned data sets. Amplifying this challenge is a need to efficiently support spatial, temporal, and similarity predicates in data subscriptions. Big Data also makes result ranking and diversification techniques critical in order for large result sets to be manageable.

Aim 3 focuses on scalable data delivery (”data out” problems). On the "data out" side, the technical challenges include the reliable and timely dissemination of data of interest to a sometimes-connected subscriber base of unprecedented scale.

The software basis for this work is AsterixDB, an open-source Big Data Management System (BDMS) that supports the scalable storage, searching, and analysis of mass quantities of semi-structured data. By leveraging AsterixDB, this project aims at the next step in big data management: creating an "active BDMS" and an open source prototype that will be made available to the Big Data research community.

The resulting techniques will be disseminated to the Big Data community through open source software sharing. The techniques are expected to be general-purpose and will inform and benefit other Big Data efforts as well. The project will help to train information management students, at our universities and elsewhere, in technologies related to Big Data management and applications; such training is critical to addressing the information explosion that social media and the mobile Web are driving today. This project can also benefit the broader public by laying a general-purpose foundation for active information dissemination from Big Data, e.g., for use in areas such as public safety and public health.

Research Activities – Year 1

During the first year of this collaborative project, the UCR team worked on the following:

(1) Parameterized Channels. We envision a Big Active Data user model where users can form their subscriptions by utilizing a collection of predefined parameterized channels. A channel in our system is defined as a parameterized, query­ based function that is registered to the system to be continuously monitored according to its specification. A user will then subscribe to a channel by providing their parameter values for the channel function; this provides for scalability: results to each user are created by joining the user parameters with the result dataset. In particular we consider two kinds of channels, ‘repetitive’ and ‘continuous’. Repetitive channels compute their result at predefined temporal intervals (every minute, at a specific time of day, etc.) and always provide the complete query result. Continuous channels are evaluated continuously and provide incrementally new results (as data changes over time). During the first year we have completed a working version for the repetitive channel case. In order to build Repetitive Channels, the following were implemented: (i) A robust DDL that enables Channels to be easily created and subscribed to using rich capabilities. (ii) A Channel Job, which is a long­ running job living in the system and able to perform its work periodically. (iii) Underlying software to monitor and interact with the long­ running jobs.

(2) Statistics and selectivity estimation in AsterixDB. Modern big data management systems (BDMS) require revisiting well ­known optimization techniques which tend to fail under the new framework. The quality of the query execution plans, produced by an optimizer largely depends on precise knowledge of statistical data distributions. However most of the modern database systems rely on offline statistics collection, which is prohibitively expensive in the BDMS context, since it requires rescanning all records in the large, distributed dataset. We lift this limitation by enabling lightweight online statistics collection. In particular we enable the statistics collection when data is inserted into the system using the LSM­ based storage model (currently used by many BDMS systems, including AsterixDB). In this case obtaining data statistics piggybacks on common LSM­ component lifecycle events (like flush, merge), and thus does not require additional data scanning. While previous data management systems traditionally used histograms to store compressed representations of underlying data, this proved to be very inefficient in our context, when statistics are incrementally assembled from the smaller (LSM­ component) pieces. We have evaluated alternative data structures and concluded that using Haar wavelets synopses enables an efficient merge operation which is required both to assemble the statistics from various LSM­ components and statistics from different nodes in the partitioned distributed cluster setup. We have implemented algorithms, carrying out initial creation and merging of Haar wavelets and successfully integrated them into the AsterixDB LSM­ framework.

(3) Optimize compaction process in AsterixDB. Here the goal is to optimize compaction in AsterixDB. For that, we proposed the novel K slot merge policy, which has been proven to be optimally competitive and k-competitive for SSTable stack compaction. Besides, implementing the K slot merge policy in AsterixDB, we also introduced the constant­-prefix merge policy, a variant of the default prefix policy, in AsterixDB. This was necessary to do a fair comparison as the default prefix policy does not compact large disk components to avoid major compaction. For our experimental set up, we explored several benchmarks including YCSB and big data benchmark. However, we found that these benchmarks are not suitable for AsterixDB as they require optimized unit insertions which is currently being worked on in AsterixDB. We implemented REST clients for continuously querying AsterixDB, creating and connecting feeds to data set, and using load statements for importing data from a file. This work will be submitted for publication in the upcoming months.

(4) Support Secondary Attribute Lookups in Key­-Value Store Databases. NoSQL databases, such as key-value stores, achieve fast write throughput and fast lookups on the primary key. However, many applications also require queries on non-primary attributes. For that, several NoSQL databases have added support for secondary indexes. To our best knowledge, little work has studied how to support secondary indexing on pure key­-value stores, which are a fundamental and popular category within the NoSQL databases range. We propose a novel lightweight secondary indexing technique on log-structure merge-­tree (LSM tree)­-based key­-value stores, which we refer as ``embedded index''. The embedded index, which utilizes Bloom filters, is very space­ efficient, and achieves very high write­-throughput rates, comparable to non-indexed key­-value stores. It is embedded inside the data files, that is, no separate index table is maintained. We implemented all indexing schemes on Google's popular open­-source LevelDB key-value store. Our comprehensive experimental and theoretical evaluation reveals interesting trade­-offs between the indexing strategies in terms of read and write-throughputs. A key result is that the embedded index is superior for high write-throughput requirements. We created and published a realistic Twitter-style read/write workload generator to facilitate our experiments. We also published our index implementations on LevelDB as open source. This work has been submitted for publication.

Furthermore we worked on the following problems:

(5) Accelerating in­ memory database operations on FPGAs. In traditional CPU architectures large hierarchies of caches have been introduced to cope with large RAM access latencies. This approach proved to be very efficient in a case when data access yields locality (temporal or/and spatial). However many important hash ­based database algorithms (such as join and group-­by aggregation) suffer from irregular data access. We have suggested that using instead FPGAs allows an alternative approach that hides the lack of locality by using hardware multithreading. We prototyped FPGA ­accelerated versions of hash based join and aggregation algorithms, comparing their performance with the best cache-optimized software counterparts. Our comparison showed that the FPGA multithreading approach is able to achieve significant speedups, even on current generation of FPGAs (clocking about 1000 times slower than a regular CPU). This work was published in CIDR'15.

(6) A Study of the Demographics of Online Health Social Outlets. The rapid spread of online social media has impacted how patients share health related information. However, little work has studied the user demographics. We study the demographics of users who participate in health­-related online social outlets to identify possible links to healthcare disparities. We analyze and compare three different types of health related social outlets: (i) general online social networks Twitter and Google+; (ii) drug review websites; and (iii) health web forums. We focus on the following demographic attributes: age, gender, ethnicity, location, and writing level. We build and evaluate domain ­specific classifiers to infer missing data where possible. The estimated demographic statistics are compared against various baselines, such as Internet and social networks usage of the population. We identified interesting and actionable disparities in the participation of various demographic groups to various types of health-­related social outlets. These disparities are significantly distinct from the disparities in Internet usage or general social outlets participation. This paper was accepted for publication at the Journal of Medical Internet Research.

Research Activities – Year 2

During the second year of this collaborative project, the UCR team worked on the following:

·        Conducted research and development related to BAD Asterix, the data serving component of the BAD system.

·        Collaborated with UCI colleagues on design discussions related to the BAD data broker network, the information dissemination component of the BAD system.

·        Designed and worked on implementing two sample BAD applications and deploying BAD and the applications on a 6-node cluster. 

In particular:

(1) We have implemented a working version of repetitive channels (Aim 1) and are currently experimenting with its scalability. Our overall vision on the BAD system appears in [6] where we detailed a vision for a scalable system that can continuously and reliably capture Big Data to enable timely and automatic delivery of new information to a large pool of interested users as well as supporting analyses of historical information. This first paper zooms in on the Data Serving piece of the BAD puzzle, including its key concepts and user model.

(2) We considered the problem of applying diversification (Aim 2) on the posts arriving from a collection of microblog streams [7]. Today users conveniently consume content through subscribing to content generators such as Twitter or news agencies. However, given the number of subscriptions and the rate of the subscription streams, users suffer from the information overload problem. To address this issue, we proposed a novel and flexible diversification paradigm to prune redundant posts from a collection of streams. A key novelty of our diversification model is that it holistically incorporates three important dimensions of social posts, namely content, time and author. Different applications, such as microblogging, news or bibliographic services, require different settings for these three dimensions. Further, each dimension poses unique performance challenges towards scaling the diversification model for many users and many high-throughput streams. In [7] we showed that hash-based content distance measures and graph-based author distance measures are both effective and efficient for social posts. Further, we proposed scalable real-time stream processing algorithms leveraging efficient indexes that input a social post stream and output a diversified version of the stream, diversified across all three dimensions. These techniques can be extended to serve multiple users by appropriately reusing indexing and computation where possible. 

Moreover, we also worked on the following problems:

In [8] we examined indexing in high dimensional domains and presented DiVA, a multi-dimensional indexing approach that combines the selective use of an approximation approach with an indexing mechanism to organize data subspaces in a high fan-out hierarchical structure. Moreover, DiVA reorganizes its own elements after receiving application hints regarding data access patterns. These hints or policies trigger the restructuring and possible expansion of DiVA so as to offer finer indexing granularity and improved access times in subspaces emerging as ‘hot-spots’. The novelty of our approach lies in the self-organizing nature of DiVA driven by application-provided policies; the latter effectively guide the refinement of DiVA's elements as new data arrive, existing data are updated and the nature of query workloads continually changes. An extensive experimental evaluation using real data shows that DiVA reduces up-to 64% of the total number of I/Os if compared with state-of-art methods including the VA-file, GC-tree and the A-tree.

In [9] we demoed a framework we have built to perform spatiotemporal investigative search over social media posts. Most research on analyzing social data has focused on detecting trends and patterns such as bursty topics and popular spatiotemporal paths, event extraction, studying how information is spread, or analyzing properties of the social graph. In contrast, and given the spatiotemporal nature of many of these social posts, we are interested in how to search social network data items, posts and images, based on spatiotemporal keyword queries. In contrast to the trending queries studied by previous work, investigative search can also be viewed as exploring the currently untapped long tail of the distribution of topics in social networks. This work led to the creation of OSNI (for Online Social Network Investigator) a scalable distributed system to search social network data, based on a spatiotemporal window and a list of keywords.

Our work in [10] presents a study that analyzed the content of Web-based health-related social media based on users' demographics so as to identify which health topics are discussed in which social media by which demographic groups and to help guide educational and research activities. In particular we analyzed three different types of health-related social media: (1) general Web-based social networks Twitter and Google+; (2) drug review websites; and (3) health Web forums, with a total of about 6 million users and 20 million posts. We analyzed the content of these posts based on the demographic group of their authors, in terms of sentiment and emotion, top distinctive terms, and top medical concepts. The results of this study identified the dominant topics for various categories of users of these sites, based on sex, age, geographical location as well as education.

Finally, in [11] we looked at health provider portals and performed a correlation analysis to see whether traditional quality indicators (medical school, graduation year, procedures, fellowships, patient reviews, location, and technology usage) are associated with the provider's quality (as indicated by the referral frequency of a provider and a peer-nominated quality designation). Our data-driven analysis identified several attributes that correlate with and discriminate against referral volume and peer-nominated awards. In particular, our results demonstrated that these attributes vary by locality and that the frequency of an attribute is more important than its value (e.g., the number of patient reviews or hospital affiliations are more important than the average review rating or the ranking of the hospital affiliations, respectively). We demonstrated that it is possible to build accurate classifiers for referral frequency and quality designation, with accuracies over 85%.

Research Activities – Year 3

During the third year of this collaborative project, the UCR team worked on the following:

·     Continued research and development related to BAD Asterix, the data serving component of the BAD system.

·     Collaborated with UCI colleagues on the design of the BAD data broker network, the information dissemination component of the BAD system.

·     Completed the implementation of two sample BAD applications and deployed BAD and the applications on a cluster.

·     Worked on an efficient system to collect statistics in workloads with rapid data ingestion.

·     Implemented a scalable parallel processor for JSON queries on top of Apache VXQuery.

·     We worked on an efficient storage system for pub/sub.

In particular:

We continued working on the implementation of repetitive channels for our Big Active Data (BAD) system. To show the effectiveness of the system we have created two end to end applications. The first application mimics an ``emergency notification system” where users can receive information about emergencies (for example, earthquakes, floods, shootings, etc.) via their subscriptions. Emergency reports (containing useful information related to emergency situations, and including temporal and spatial attributes) may be published by agencies. Notifications to users can be enhanced with additional data such as nearby shelters and their locations. The second is a Twitter-based mobile application which allows a user to subscribe to her Twitter friends (i.e. followers and/or followees), and if the user, a subset of her friends and a coffee shop are in close proximity, the application will notify the user that they can meet in that coffee shop. We get the latest location of all the users and their friends in real-time when they post geo-tagged tweets. The BAD subscription channel checks for this condition periodically (e.g., every 5 seconds) and notifies the broker once a meetup is possible. Both scenarios were demoed in [12]. Further, we were invited to present our emergency demo at the DHS sponsored Workshop on Real-time Analytics in Multi-latency, Multi-Party, Metro-scale Networks [13].

One challenge for a Big Active Data system is keeping up with the influx of incoming data while providing useful analytics to the users. A popular solution to handle rapidly incoming data is to rely on log-structured merge (LSM) storage models. LSM-based systems provide a tunable trade-off between ingesting vast amounts of data at a high rate and running efficient analytical queries on top of that data. In [14] we address the problem of computing data statistics in workloads with rapid data ingestion and propose a lightweight statistics-collection framework that exploits the properties of LSM storage. To the best of our knowledge, this is the first design tailored specifically to work in LSM-based systems. Our approach is designed to piggyback on the events (flush and merge) of the LSM lifecycle; this allows us to easily create an initial statistics collection and then keep it in sync with rapidly changing data while minimizing the overhead to the existing system. We have implemented and adapted well-known algorithms to produce various types of statistical synopses, including equi-width histograms, equi-height histograms, and wavelets. We performed an in-depth empirical evaluation that considers both the cardinality estimation accuracy and runtime overheads of collecting and using statistics. 

We also worked on creating a scalable processor for JSON data. The increased interest in JSON arises from its wide use, especially in IoT data streams. Although JSON is a simple data exchange format, its querying is not always effective, especially in the case of large repositories of data. Instead of building a JSON processor from scratch, in our work we instead integrate the JSONiq extension to the XQuery language specification into an existing query processor (Apache VXQuery) to enable it to query JSON data in parallel. VXQuery was built (by our group) on top of Hyracks (a framework that generates parallel jobs) and Algebricks (a language-agnostic query algebra toolbox) and can process data on the fly, in contrast to other well known systems which need to load data first. Thus, the extra cost of data loading is eliminated. In [15] we implement three categories of rewrite rules which exploit the features of the above platforms to efficiently handle path expressions along with introducing intra-query parallelism. We evaluated our implementation using a large (192GB) dataset of sensor readings. Our results show that the proposed rewrite rules lead to efficient and scalable parallel processing of JSON data.

Further, we built an efficient storage for publish-subscribe systems. Publish/Subscribe systems allow subscribers to monitor for events of interest generated by publishers. Current publish/subscribe query systems are efficient when the subscriptions (queries) are relatively static – for instance, the set of followers in Twitter – or can fit in memory. However, an increasing number of applications in this era of Big Data and Internet of Things (IoT) are based on a highly dynamic query paradigm, where continuous queries are in the millions and are created and expire in a rate comparable, or even higher, to that of the data (event) entries. For instance, moving objects like airplanes, cars or sensors may continuously generate measurement data like air pressure or traffic, which are consumed by other moving objects. In [16] we propose and compare a novel publish/subscribe storage architecture, DualDB, based on the popular NoSQL Log-Structured Merge Tree (LSM) storage paradigm, to support high-throughput and dynamic publish/subscribe systems. Our method naturally supports queries on both past and future data, and generate instant notifications, which are desirable properties missing from many previous systems. We implemented and experimentally evaluated our methods on the popular LSM-based LevelDB system, using real datasets. Our results show that we can achieve significantly higher throughput compared to state-of-the-art baselines.

Research Activities – Year 4

During the fourth year of this grant, we performed the following tasks:

·    We continued the research and development related to BAD Asterix, the data serving component of the BAD system.

·    Continued the collaboration with the UCI colleagues on the design of the BAD data broker network, the information dissemination component of the BAD system.

·    Completed the work on collecting statistics in workloads with rapid data Ingestion.

·    We completed the design of a High-throughput Publish/Subscribe system on top of LSM-based Storage

·    We performed a Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases

In particular, our efforts can be described below:

1. We continued working on the implementation of repetitive channels for our Big Active Data (BAD) system. In particular, We performed a performance evaluation of the BAD project in comparison with passive Big Data management systems. We used the Opportunistic Network Environment Simulator to create movement traces of many thousands of users in the road networks of Helsinki, as well as generating simulated emergencies. We compared the BAD model for users subscribing to emergencies near their locations with a passive polling model based on current big data management systems. Our experiments showed that BAD performed up to 2 orders of magnitude faster than passive Big Data. We also performed scale-out and scale-up experiments on a large cluster. Our results show that BAD scales very well. This work resulted in a paper under review [17]. Furthermore, we enhanced the system by implementing subscription sharing. In particular, we redesigned subscription management in order to capitalize on common subscriptions and provided an initial performance evaluation for this shared subscription model. The experimental evaluation showed up to 6 times better performance using the new shared model. In addition to the pull-based model we implemented a push-based channel model (where results are sent directly to brokers rather than being staged and fetched later), including an initial performance comparison between push-based and pull-based channels. Finally, we implemented recovery changes to allow channels to continue execution in the presence of cluster failures, as well as allowing channels to adapt to the creation of new indexes. This work led to a PhD thesis [18].

2. We completed the work on collecting statistics in a big data management system under a workload with rapid data ingestion. We consider two problems, collecting statistics on indexed columns (which orders the stream of records based on the index key), as well as on non-indexed (unordered) columns. For each case, appropriate statistical summaries are considered so that the overall overhead on the system’s critical path remains low, thus not affecting the ingestion process. This work led to a PhD thesis [19].

3. We continued our work on building a High-throughput Publish/Subscribe on top of LSM-based Storage. State-of-the-art publish/subscribe systems are efficient when the subscriptions are relatively static – for instance, the set of followers in Twitter – or can fit in memory. However, nowadays, many Big Data and IoT based applications follow a highly dynamic query paradigm, where both continuous queries and data entries are in the millions and can arrive and expire rapidly. We propose and compare several publish/subscribe storage architectures, based on the popular NoSQL Log-Structured Merge Tree (LSM) storage paradigm, to support high-throughput and highly dynamic publish/subscribe systems. Our framework naturally supports subscriptions on both historic and future streaming data, and generates instant notifications. Further, we show how hierarchical attributes, such as concept ontologies, can be efficiently supported. We implemented and experimentally evaluated our methods on the popular LSM-based LevelDB system, using real datasets, for simple match and self-joining subscriptions on both flat and hierarchical attributes. Our approaches achieve significantly higher throughput compared to state-of-the-art pub-subs. This work led to a journal paper [20].

4. We performed a Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases. NoSQL databases are increasingly used in big data applications, because they achieve fast write throughput and fast lookups on the primary key. Many of these applications also require queries on non-primary attributes. For that reason, several NoSQL databases have added support for secondary indexes. As there is no single system that supports all types of secondary indexes, no experimental head-to-head comparison or performance analysis of the various secondary indexing techniques in terms of throughput and space exists. We present a taxonomy of NoSQL secondary indexes, broadly split into two classes: Embedded Indexes (i.e. lightweight filters embedded inside the primary table) and Stand-Alone Indexes (i.e. separate data structures). To ensure the fairness, we built a system, LevelDB++, on top of Google’s popular open-source LevelDB key-value store. There, we implemented two Embedded Indexes and three state-of-the-art Stand-Alone indexes, which cover most of the popular NoSQL databases. Our comprehensive experimental study and theoretical evaluation show that none of these indexing techniques dominate the others: the embedded indexes offer superior write throughput and are more space efficient, whereas the stand-alone secondary indexes achieve faster query response times. The optimal choice of secondary index depends on the application workload. This work appears in [21]. The research led to a PhD thesis [22].

Research Activities – Year 5

In collaboration with our colleagues at UCI, we developed a holistic Big Active Data (BAD) Platform. Outside the platform are data sources (Data Publishers) and end users (Data Subscribers). The platform’s components provide two broad areas of functionality -- Big Data monitoring and management (handled by the BAD Data Cluster, the lower half of the platform) and user notification management and distribution (handled by the BAD Broker Network, the upper half of the platform). In particular, the efforts of the UCR team were within the Big Data monitoring and management part.

A fully functional prototype of the scalable BAD Data Cluster was built, by extending the Apache AsterixDB Big Data Management System (BDMS). Within this effort, during the fifth year of this project we collaborated with our UCI colleagues on:

1. The development of a user model and accompanying DDL for Big Active Data based on a notion of subscribable, parameterized, declarative (i.e., query-based) data channels that deliver data of interest to a potentially large collection of interested users (channel subscribers).

2. The development of an Active Toolkit to extend AsterixDB with additional capabilities needed in support of BAD, at the BAD Data Cluster level, including: (i) Deployed jobs that can perform arbitrary SQL++ tasks; they are compiled and distributed once and used multiple times for efficiency. (ii) Data channels to actively process data with respect to subscriber interests; a single channel is compiled once and then executed on behalf of a scalable number of users, yet it produces individualized staged results. (iii) Procedures that can use deployed jobs to perform other active management processes regularly and efficiently.

3. The development of a handful of simple example applications to drive and to verify the BAD platform’s functionality and usability, including a hypothetical emergency response application (packaged as a demo/game that multiple users can participate in), a nearby Starbucks friend-finding application, and a Craigslist-inspired classified advertising application.

Other major activities include:

5. We contacted a thorough experimental evaluation of Bounded-Depth LSM Merge Policies [23]. In BAD we use log-structured merge (LSM) storage to support the high write throughput needed for efficiently handling the high rate incoming updates. In this work, we considered various merge policies of disk components (SSTables); the experimental evaluation showed the advantages of the Binomial merge policy, which has now been added to the AsterixDB code base.

6. We completed an experimental evaluation of scalable algorithms for supporting temporal interval joins. In particular, we adapted five recently published state-of-the-art overlapping interval join algorithms and modified them to work in a shared-nothing big data management system (AsterixDB) under a memory budget. We developed a cost model for each algorithm to predict when an algorithm will spill to disk (run out of memory). Our experimental evaluation shows the cost models are accurate and can be used to pick the most efficient algorithm for the given input data. This work led to a Ph.D. thesis [24].

7. We contacted a comparison of synopsis techniques for approximate answering over big spatial data. In particular, we experimentally studied four spatial data synopsis techniques for three common data analysis problems, namely, selectivity estimation, k-means clustering, and spatial partitioning. We run an extensive experimental evaluation on both real and synthetic datasets of up to 2.7 billion records to study the trade-offs between the synopsis methods and their applicability in big spatial data analysis [25].

10. We examined how to best implement hash-joins and hash-based group-by aggregation using hardware accelerators (FPGAs). A major bottleneck in in-memory analytics is the high memory access latency. Traditionally, this problem is solved with large cache hierarchies that only benefit regular applications. Alternatively, many data-intensive applications exhibit irregular behavior. We use massive hardware multithreading to cope with high latency seen in such applications. Our FPGA implementations of the hash-join and hash-based group-by aggregation, showed much higher throughput than the best known CPU multicore implementations [26].

11. We completed a study of the Top-K and Skyline selection operators in the context of emerging parallel architectures and developed work-efficient algorithms suitable for parallel main memory processing. We concentrated on multi-core (CPU), many-core (GPU), and processing-in-memory architectures (PIM), developing solutions optimized for high throughout and low latency [27], [28], [29]. This work led to a Ph.D. thesis [30].


[1] Mohiuddin Abdul Qader, Shiwen Cheng, Abhinand Menon, Vagelis Hristidis: “Efficient Secondary Attribute Lookup in NoSQL Databases”; submitted for publication.

[2] Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras: “FPGA­based Multithreading for InMemory Hash Joins”, Seventh Biennial Conference on Innovative Data Systems Research (CIDR), 2015, link

[3] Roger Moussalli, Ildar Absalyamov, Marcos R. Vieira, Walid A. Najjar, Vassilis J. Tsotras: “High performance FPGA and GPU complex pattern matching over spatio­temporal streams”, Geoinformatica. Vol. 19, No. 2, 2015, link 

[4] Shouq A Sadah, Moloud Shahbazi, Matthew T Wiley, Vagelis Hristidis: “A Study of the Demographics of Online Health Social Outlets”, Journal of Medical Internet Research, 2015, link

[5] V. Borkar, Y. Bu, E. Carman, N. Onose, T. Westmann, P. Pirzadeh, M. Carey, and V.J. Tsotras, “Algebricks: A Data ModelAgnostic Compiler Backend for Big Data Languages”, ACM Symp. on Cloud Computing (SoCC), p 422-433, 2015, link

[6] Michael J. Carey, Steven Jacobs, and Vassilis J. Tsotras. 2016. “Breaking BAD: a Data Serving Vision for Big Active Data”, Proc. of the 10th ACM Int’l.l Conf. on Distributed and Event-Based Systems (DEBS '16), Irvine, CA, June 2016, link 

[7] Shiwen Cheng, Marek Chrobak, and Vagelis Hristidis. “Slowing the Firehose: Multi-Dimensional Diversity on Social Post Streams”, Proc. of the Int’l. Conf. on Extending Database Technology (EDBT ‘16), Bordeaux, France, March 2016, link

[8] Konstantinos Tsakalozos, Spiros Evangelatos, Fotis Psallidas, Marcos R. Vieira, Vassilis J. Tsotras and Alex Delis, “DiVA: Using Application-Specific Policies to ‘Dive’ into Vector Approximations”, The Computer Journal (2016) 59 (9): 1363-1382, link

[9] Shiwen Cheng, James Fang, Vagelis Hristidis, Harsha V. Madhyastha, Niluthpol Chowdhury Mithun, Dorian Perkins, Amit K. Roy-Chowdhury, Moloud Shahbazi, Vassilis J. Tsotras, "OSNI: Searching for Needles in a Haystack of Social Network Data", Proc. of the Int’l. Conf. on Extending Database Technology (EDBT ‘16), Bordeaux, France, March 2016, pp 616-619, link

[10] Shouq A. Sadah, Moloud Shahbazi, Matthew T. Wiley, and Vagelis Hristidis, “Demographic-Based Content Analysis of Online Health-Related Social Media”, Journal of Medical Internet Research, Vol. 18, No. 6, June 2016, link

[11] Matthew T. Wiley, Ryan L. Rivas, and Vagelis Hristidis, “Provider Attributes Correlation Analysis to Their Referral Frequency and Awards”, BMC Health Services Research, Vol. 16, No. 1, March 2016, link

[12] Steven Jacobs, Md. Yusuf Sarwar Uddin, Michael J. Carey, Vagelis Hristidis, Vassilis J. Tsotras, Nalini Venkatasubramanian, Yao Wu, Syed Safir, Purvi Kaul, Xikui Wang, Mohiuddin Abdul Qader, Yawei Li: A BAD Demonstration: Towards Big Active Data. PVLDB 10(12): 1941-1944 (2017)

[13] Steven Jacobs, Md Yusuf Sarwar Uddin, Purvi Kaul, Syed Safir, Xikui Wang, Yao Wu, Waleed Amjad, Michael Carey, Vassilis Tsotras, Nalini Venkatasubramanian (2017). Towards Situational Awareness on Big Data: A BAD Approach. RAMMMNets 2017: Workshop on Real-time Analytics in Multi-latency, Multi-Party, Metro-scale Networks. ICDE Workshop. 

[14] Ildar Absalyamov, Michael J. Carey, Vassilis J. Tsotras; Lightweight Cardinality Estimation in LSM-based Systems. ACM SIGMOD Conference 2018.

[15] Christina Pavlopoulou, E. Preston Carman, Jr., Till Westmann, Michael J. Carey, Vassilis J. Tsotras (2017). A Parallel and Scalable Processor for JSON Data. EDBT 2018.

[16] Mohiuddin Abdul Qader and Vagelis Hristidis. DualDB: An Efficient LSM-based Publish-Subscribe Storage System. International Conference on Scientific and Statistical Database Management (SSDBM), short paper, 2017

[17] Steven Jacobs, Xikui Wang, Michael J. Carey, Vassilis J. Tsotras, Md Yusuf Sarwar Uddin, "BAD to the Bone: Big Active Data at its Core", under review.

[18] Steven Jacobs, "A BAD (Big Active Data) Thesis", PhD Thesis, Dept. of Computer Science, UC Riverside (defended Sept. 2018).

[19] Ildar Absalyamov, "Query Processing and Cardinality Estimation in Modern Database Systems", PhD Thesis, Dept. of Computer Science, UC Riverside (defended June 2018).

[20] Mohiuddin Abdul Qader, Vagelis Hristidis. High-throughput Publish/Subscribe on top of LSM-based Storage. Distributed and Parallel Databases (DAPD) Journal, 2019

[21] Mohiuddin Abdul Qader, Shiwen Cheng, Vagelis Hristidis.A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases. ACM SIGMOD International Conference on Management of Data (SIGMOD), 2018.

[22]  Mohiuddin Abdul Qader, "I/O Optimization in Big Data Storage Systems", PhD Thesis, Dept. of Computer Science, UC Riverside (defended July 2018).

[23] Qizhong Mao, Steven Jacobs, Waleed Amjad, Vagelis Hristidis, Vassilis J. Tsotras, Neal E. Young, "Experimental Evaluation of Bounded-Depth LSM Merge Policies", 2019 IEEE International Conference on Big Data,  December 9-12, 2019, Los Angeles, USA (to appear)

[24] Eldon Preston Carman , "Interval Joins for Big Data and a Scalable Parallel XQuery Processor", Ph.D. Thesis, Department of Computer Science and Engineering, UC Riverside, Dec. 2019. 

[25] Muhammad Abu Bakar Siddique, Ahmed Eldawy, Vagelis Hristidis, "Comparing Synopsis Techniques for Approximate Spatial Data Analysis", Proceedings of the VLDB Endowment (PVLDB) 2019, Vol. 12, pp 1583-1596.

[26] Skyler Windh, Bashar Romanous, Ildar Absalyamov, Prerna Budhkar, Robert J. Halstead, Walid A. Najjar, Vassilis J.Tsotras, "Efficient Local Locking for Massively Multithreaded In-Memory Hash-based Operators", submitted for publication.

[27] Vasileios Zois, Vassilis J. Tsotras, Walid A. Najjar, "Efficient Main-Memory Top-K Selection For Multicore Architectures", Proceedings of the VLDB Endowment (PVLDB), Vol. 13(2): 114-127 (2019)

[28] Vasileios Zois, Divya Gupta, Vassilis J. Tsotras, Walid A. Najjar, Jean-François Roy, "Massively parallel skyline computation for processing-in-memory architectures", 27th International Conference on Parallel Architectures and Compilation Techniques, PACT 2018, pp 1:1-1:12, Limassol, Cyprus, Nov. 2018.

[29] Vasileios Zois, Vassilis J. Tsotras, Walid A. Najjar, "GPU Accelerated Top-K Selection With Efficient Early Stopping", Tenth International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS), 2019.

[30] Vasileios Zois, "Complex Query Operators on Modern Parallel Architectures", Ph.D. Thesis, Department of Computer Science and Engineering, UC Riverside, Oct. 2019.