"Collaborative Research: Making Big Data Active: From Petabytes to Megafolks in Milliseconds "

Funded by the National Science Foundation


PI: Vassilis J. Tsotras
Co-PI: Evangelos Christidis
Award Number: IIS-1447826

This is a collaborative project with Award IIS-1447720
PI: Mike Carey, University of California, Irvine
Co-PI: Nalini Venkatasubramanian, University of California, Irvine

Duration:  09/01/2014 through 08/31/2018

Students:
Ildar Absalyamov
Waleed Amjad
Steven Jacobs
Mohiuddin Qader
Shouq Sadah
Marcel Tawamba (REU Participant)

 

Project Summary:

A wealth of digital information is being generated daily through social networks, blogs, online communities, news sources, and mobile applications as well as device-based sources in an increasingly sensed world. Organizations and researchers in diverse domains recognize that tremendous value and insight can be gained by capturing this emerging data and making it available for querying and analysis. A number of domains, such as public health, national security, and public safety, could benefit greatly from timely awareness of situations as they first become observable in Big Data. There is a need to move beyond passive Big Data analytical infrastructures – Big Active Data is needed!

First-generation Big Data management efforts yielded MapReduce-based frameworks such as Hadoop, Pig, and Hive that address after-the-fact data analytics, key-value (“NoSQL”) stores that provide scalable, keyed record storage and retrieval, and a handful of specialized systems that target problems like parallel graph analytics and data stream analysis. This project aims to continuously and reliably capture Big Data arising from social, mobile, Web, and sensed data sources and enable timely delivery of information to users with indicated interests. In a nutshell, we aim to develop techniques to enable the accumulation and monitoring of petabytes of data of potential interest to millions of end users; when “interesting” new data appears, it should be delivered to end users in a timeframe measured in (100’s of) milliseconds.

Successful execution of this project will require the development of practical solutions to a number of interesting and important technical problems that we categorize in three Aims.

Aim 1 deals with scalable data input (”data in” problems). To offer declarative subscriptions, we envision the use of data channels to which users can subscribe. On the "data in" side, the technical challenges that we are addressing include resource management in very large scale LSM-based storage systems and the provision of a highly available and elastic facility for fast data ingestion.

Aim 2 addresses data channel and subscription management (”inside” problems). On the "inside", the challenges include the parallel evaluation of a large number of declarative data subscriptions over (multiple) highly partitioned data sets. Amplifying this challenge is a need to efficiently support spatial, temporal, and similarity predicates in data subscriptions. Big Data also makes result ranking and diversification techniques critical in order for large result sets to be manageable.

Aim 3 focuses on scalable data delivery (”data out” problems). On the "data out" side, the technical challenges include the reliable and timely dissemination of data of interest to a sometimes-connected subscriber base of unprecedented scale.

The software basis for this work is AsterixDB, an open-source Big Data Management System (BDMS) that supports the scalable storage, searching, and analysis of mass quantities of semi-structured data. By leveraging AsterixDB, this project aims at the next step in big data management: creating an "active BDMS" and an open source prototype that will be made available to the Big Data research community.

The resulting techniques will be disseminated to the Big Data community through open source software sharing. The techniques are expected to be general-purpose and will inform and benefit other Big Data efforts as well. The project will help to train information management students, at our universities and elsewhere, in technologies related to Big Data management and applications; such training is critical to addressing the information explosion that social media and the mobile Web are driving today. This project can also benefit the broader public by laying a general-purpose foundation for active information dissemination from Big Data, e.g., for use in areas such as public safety and public health.

Research Activities – Year 1

During the first year of this collaborative project, the UCR team worked on the following:

(1) Parameterized Channels. We envision a Big Active Data user model where users can form their subscriptions by utilizing a collection of predefined parameterized channels. A channel in our system is defined as a parameterized, query­ based function that is registered to the system to be continuously monitored according to its specification. A user will then subscribe to a channel by providing their parameter values for the channel function; this provides for scalability: results to each user are created by joining the user parameters with the result dataset. In particular we consider two kinds of channels, ‘repetitive’ and ‘continuous’. Repetitive channels compute their result at predefined temporal intervals (every minute, at a specific time of day, etc.) and always provide the complete query result. Continuous channels are evaluated continuously and provide incrementally new results (as data changes over time). During the first year we have completed a working version for the repetitive channel case. In order to build Repetitive Channels, the following were implemented: (i) A robust DDL that enables Channels to be easily created and subscribed to using rich capabilities. (ii) A Channel Job, which is a long­ running job living in the system and able to perform its work periodically. (iii) Underlying software to monitor and interact with the long­ running jobs.

(2) Statistics and selectivity estimation in AsterixDB. Modern big data management systems (BDMS) require revisiting well ­known optimization techniques which tend to fail under the new framework. The quality of the query execution plans, produced by an optimizer largely depends on precise knowledge of statistical data distributions. However most of the modern database systems rely on offline statistics collection, which is prohibitively expensive in the BDMS context, since it requires rescanning all records in the large, distributed dataset. We lift this limitation by enabling lightweight online statistics collection. In particular we enable the statistics collection when data is inserted into the system using the LSM­ based storage model (currently used by many BDMS systems, including AsterixDB). In this case obtaining data statistics piggybacks on common LSM­ component lifecycle events (like flush, merge), and thus does not require additional data scanning. While previous data management systems traditionally used histograms to store compressed representations of underlying data, this proved to be very inefficient in our context, when statistics are incrementally assembled from the smaller (LSM­ component) pieces. We have evaluated alternative data structures and concluded that using Haar wavelets synopses enables an efficient merge operation which is required both to assemble the statistics from various LSM­ components and statistics from different nodes in the partitioned distributed cluster setup. We have implemented algorithms, carrying out initial creation and merging of Haar wavelets and successfully integrated them into the AsterixDB LSM­ framework.

(3) Optimize compaction process in AsterixDB. Here the goal is to optimize compaction in AsterixDB. For that, we proposed the novel K slot merge policy, which has been proven to be optimally competitive and k-competitive for SSTable stack compaction. Besides, implementing the K slot merge policy in AsterixDB, we also introduced the constant­-prefix merge policy, a variant of the default prefix policy, in AsterixDB. This was necessary to do a fair comparison as the default prefix policy does not compact large disk components to avoid major compaction. For our experimental set up, we explored several benchmarks including YCSB and big data benchmark. However, we found that these benchmarks are not suitable for AsterixDB as they require optimized unit insertions which is currently being worked on in AsterixDB. We implemented REST clients for continuously querying AsterixDB, creating and connecting feeds to data set, and using load statements for importing data from a file. This work will be submitted for publication in the upcoming months.

(4) Support Secondary Attribute Lookups in Key­-Value Store Databases. NoSQL databases, such as key-value stores, achieve fast write throughput and fast lookups on the primary key. However, many applications also require queries on non-primary attributes. For that, several NoSQL databases have added support for secondary indexes. To our best knowledge, little work has studied how to support secondary indexing on pure key­-value stores, which are a fundamental and popular category within the NoSQL databases range. We propose a novel lightweight secondary indexing technique on log-structure merge-­tree (LSM tree-based key­-value stores, which we refer as ``embedded index''. The embedded index, which utilizes Bloom filters, is very space­ efficient, and achieves very high write­-throughput rates, comparable to non-indexed key­-value stores. It is embedded inside the data files, that is, no separate index table is maintained. We implemented all indexing schemes on Google's popular open­-source LevelDB key-value store. Our comprehensive experimental and theoretical evaluation reveals interesting trade­-offs between the indexing strategies in terms of read and write-throughputs. A key result is that the embedded index is superior for high write-throughput requirements. We created and published a realistic Twitter-style read/write workload generator to facilitate our experiments. We also published our index implementations on LevelDB as open source. This work has been submitted for publication.


Furthermore we worked on the following problems:

(5) Accelerating in­ memory database operations on FPGAs. In traditional CPU architectures large hierarchies of caches have been introduced to cope with large RAM access latencies. This approach proved to be very efficient in a case when data access yields locality (temporal or/and spatial). However many important hash ­based database algorithms (such as join and group-­by aggregation) suffer from irregular data access. We have suggested that using instead FPGAs allows an alternative approach that hides the lack of locality by using hardware multithreading. We prototyped FPGA ­accelerated versions of hash based join and aggregation algorithms, comparing their performance with the best cache-optimized software counterparts. Our comparison showed that the FPGA multithreading approach is able to achieve significant speedups, even on current generation of FPGAs (clocking about 1000 times slower than a regular CPU). This work was published in CIDR'15.

(6) A Study of the Demographics of Online Health Social Outlets. The rapid spread of online social media has impacted how patients share health related information. However, little work has studied the user demographics. We study the demographics of users who participate in health­-related online social outlets to identify possible links to healthcare disparities. We analyze and compare three different types of health related social outlets: (i) general online social networks Twitter and Google+; (ii) drug review websites; and (iii) health web forums. We focus on the following demographic attributes: age, gender, ethnicity, location, and writing level. We build and evaluate domain ­specific classifiers to infer missing data where possible. The estimated demographic statistics are compared against various baselines, such as Internet and social networks usage of the population. We identified interesting and actionable disparities in the participation of various demographic groups to various types of health-­related social outlets. These disparities are significantly distinct from the disparities in Internet usage or general social outlets participation. This paper was accepted for publication at the Journal of Medical Internet Research.

Research Activities – Year 2

During the second year of this collaborative project, the UCR team worked on the following:

·        Conducted research and development related to BAD Asterix, the data serving component of the BAD system.

·        Collaborated with UCI colleagues on design discussions related to the BAD data broker network, the information dissemination component of the BAD system.

·        Designed and worked on implementing two sample BAD applications and deploying BAD and the applications on a 6-node cluster.

In particular:

(1) We have implemented a working version of repetitive channels (Aim 1) and are currently experimenting with its scalability. Our overall vision on the BAD system appears in [6] where we detailed a vision for a scalable system that can continuously and reliably capture Big Data to enable timely and automatic delivery of new information to a large pool of interested users as well as supporting analyses of historical information. This first paper zooms in on the Data Serving piece of the BAD puzzle, including its key concepts and user model.

(2) We considered the problem of applying diversification (Aim 2) on the posts arriving from a collection of microblog streams [7]. Today users conveniently consume content through subscribing to content generators such as Twitter or news agencies. However, given the number of subscriptions and the rate of the subscription streams, users suffer from the information overload problem. To address this issue, we proposed a novel and flexible diversification paradigm to prune redundant posts from a collection of streams. A key novelty of our diversification model is that it holistically incorporates three important dimensions of social posts, namely content, time and author. Different applications, such as microblogging, news or bibliographic services, require different settings for these three dimensions. Further, each dimension poses unique performance challenges towards scaling the diversification model for many users and many high-throughput streams. In [7] we showed that hash-based content distance measures and graph-based author distance measures are both effective and efficient for social posts. Further, we proposed scalable real-time stream processing algorithms leveraging efficient indexes that input a social post stream and output a diversified version of the stream, diversified across all three dimensions. These techniques can be extended to serve multiple users by appropriately reusing indexing and computation where possible.

Moreover, we also worked on the following problems:

In [8] we examined indexing in high dimensional domains and presented DiVA, a multi-dimensional indexing approach that combines the selective use of an approximation approach with an indexing mechanism to organize data subspaces in a high fan-out hierarchical structure. Moreover, DiVA reorganizes its own elements after receiving application hints regarding data access patterns. These hints or policies trigger the restructuring and possible expansion of DiVA so as to offer finer indexing granularity and improved access times in subspaces emerging as ‘hot-spots’. The novelty of our approach lies in the self-organizing nature of DiVA driven by application-provided policies; the latter effectively guide the refinement of DiVA's elements as new data arrive, existing data are updated and the nature of query workloads continually changes. An extensive experimental evaluation using real data shows that DiVA reduces up-to 64% of the total number of I/Os if compared with state-of-art methods including the VA-file, GC-tree and the A-tree.

In [9] we demoed a framework we have built to perform spatiotemporal investigative search over social media posts. Most research on analyzing social data has focused on detecting trends and patterns such as bursty topics and popular spatiotemporal paths, event extraction, studying how information is spread, or analyzing properties of the social graph. In contrast, and given the spatiotemporal nature of many of these social posts, we are interested in how to search social network data items, posts and images, based on spatiotemporal keyword queries. In contrast to the trending queries studied by previous work, investigative search can also be viewed as exploring the currently untapped long tail of the distribution of topics in social networks. This work led to the creation of OSNI (for Online Social Network Investigator) a scalable distributed system to search social network data, based on a spatiotemporal window and a list of keywords.

Our work in [10] presents a study that analyzed the content of Web-based health-related social media based on users' demographics so as to identify which health topics are discussed in which social media by which demographic groups and to help guide educational and research activities. In particular we analyzed three different types of health-related social media: (1) general Web-based social networks Twitter and Google+; (2) drug review websites; and (3) health Web forums, with a total of about 6 million users and 20 million posts. We analyzed the content of these posts based on the demographic group of their authors, in terms of sentiment and emotion, top distinctive terms, and top medical concepts. The results of this study identified the dominant topics for various categories of users of these sites, based on sex, age, geographical location as well as education.

Finally, in [11] we looked at health provider portals and performed a correlation analysis to see whether traditional quality indicators (medical school, graduation year, procedures, fellowships, patient reviews, location, and technology usage) are associated with the provider's quality (as indicated by the referral frequency of a provider and a peer-nominated quality designation). Our data-driven analysis identified several attributes that correlate with and discriminate against referral volume and peer-nominated awards. In particular, our results demonstrated that these attributes vary by locality and that the frequency of an attribute is more important than its value (e.g., the number of patient reviews or hospital affiliations are more important than the average review rating or the ranking of the hospital affiliations, respectively). We demonstrated that it is possible to build accurate classifiers for referral frequency and quality designation, with accuracies over 85%.

Publications:

[1] Mohiuddin Abdul Qader, Shiwen Cheng, Abhinand Menon, Vagelis Hristidis: “Efficient Secondary Attribute Lookup in NoSQL Databases”; submitted for publication.

[2] Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras: “FPGA­based Multithreading for InMemory Hash Joins”, Seventh Biennial Conference on Innovative Data Systems Research (CIDR), 2015, link

[3] Roger Moussalli, Ildar Absalyamov, Marcos R. Vieira, Walid A. Najjar, Vassilis J. Tsotras: “High performance FPGA and GPU complex pattern matching over spatio­temporal streams”, Geoinformatica. Vol. 19, No. 2, 2015, link

[4] Shouq A Sadah, Moloud Shahbazi, Matthew T Wiley, Vagelis Hristidis: “A Study of the Demographics of Online Health Social Outlets”, Journal of Medical Internet Research, 2015, link

[5] V. Borkar, Y. Bu, E. Carman, N. Onose, T. Westmann, P. Pirzadeh, M. Carey, and V.J. Tsotras, “Algebricks: A Data ModelAgnostic Compiler Backend for Big Data Languages”, ACM Symp. on Cloud Computing (SoCC), p 422-433, 2015, link

[6] Michael J. Carey, Steven Jacobs, and Vassilis J. Tsotras. 2016. “Breaking BAD: a Data Serving Vision for Big Active Data”, Proc. of the 10th ACM Int’l.l Conf. on Distributed and Event-Based Systems (DEBS '16), Irvine, CA, June 2016, link

[7] Shiwen Cheng, Marek Chrobak, and Vagelis Hristidis. “Slowing the Firehose: Multi-Dimensional Diversity on Social Post Streams”, Proc. of the Int’l. Conf. on Extending Database Technology (EDBT ‘16), Bordeaux, France, March 2016, link

[8] Konstantinos Tsakalozos, Spiros Evangelatos, Fotis Psallidas, Marcos R. Vieira, Vassilis J. Tsotras and Alex Delis, “DiVA: Using Application-Specific Policies to ‘Dive’ into Vector Approximations”, The Computer Journal (2016) 59 (9): 1363-1382, link

[9] Shiwen Cheng, James Fang, Vagelis Hristidis, Harsha V. Madhyastha, Niluthpol Chowdhury Mithun, Dorian Perkins, Amit K. Roy-Chowdhury, Moloud Shahbazi, Vassilis J. Tsotras, "OSNI: Searching for Needles in a Haystack of Social Network Data", Proc. of the Int’l. Conf. on Extending Database Technology (EDBT ‘16), Bordeaux, France, March 2016, pp 616-619, link

[10] Shouq A. Sadah, Moloud Shahbazi, Matthew T. Wiley, and Vagelis Hristidis, “Demographic-Based Content Analysis of Online Health-Related Social Media”, Journal of Medical Internet Research, Vol. 18, No. 6, June 2016, link

[11] Matthew T. Wiley, Ryan L. Rivas, and Vagelis Hristidis, “Provider Attributes Correlation Analysis to Their Referral Frequency and Awards”, BMC Health Services Research, Vol. 16, No. 1, March 2016, link