Funded by the National Science Foundation
PI:
Vassilis J.
Tsotras
Award Number: IIS-0910859
Duration:
08/15/2009 through 07/31/2012
This
is a collaborative project with:
IIS-0910989.
PI: Mike Carey,
co-PI: Chen Li
University
of California, Irvine
and
IIS-0705589
PI: Alin Deutsch,
co-PI: Yannis
Papakonstantinou
University of California, San Diego
Web
Page:
http://www.cs.ucr.edu/~tsotras/asterix/index.html
Students:
Mariam
Salloum
Md Mahbub Hasan
Project Summary:
Over the past 10-15 years, the evolution of the human side of the Web (powered by HTML and HTTP) has revolutionized the way that most of us find things, buy things, and interact with our friends and colleagues, both within and across organizations. Behind the scenes, semistructured data formats and Web services are having a similar impact on the machine side of the Web. In semistructured data formats, of which XML is the de facto standard, information normally contained in a database schema or type definition is contained within the data, making it self-describing. XML is enriching the information on the Web and our ability to find it and interchange it meaningfully, as are RDF and JSON. Many industry verticals have created XML-based standards to support inter-organization data exchange and processes, and XML-based backbones such as enterprise services busses (ESBs) have gained significant adoption in industry in support of Service-Oriented Architecture (SOA) initiatives. XML is increasingly being used for document markup as well, which was its original purpose, and the Web-service-driven Software as a Service (SaaS) trend is changing the way that many organizations will access and use large software applications in the future. As a result, current indications are that the IT world will soon be awash in a sea of semistructured data – much of it XML data – and that semistructured data and services will likely play an increasingly prominent role in the IT landscape for many years to come.
In anticipation of the semistructured information explosion, this proposal targets the problems of ingesting, storing, indexing, processing, managing, and monitoring vast quantities of semistructured data with the emphasis being on vastness, i.e., scale. The project involves challenges related to parallel databases, semistructured data management, and data-intensive computing. To that end, the proposal brings together a team of five researchers, drawn from three UC campuses, with expertise spanning structured, semistructured, and unstructured data.
Research Activities – Year 1:
During the first (still underway) year of the ASTERIX effort, we have concentrated on the following research activities:
1. Design of the triggering system at ASTERIX: We have completed a thorough review of related/previous work on triggers from the relational and more recently the XML world. We are collaborating with the UC Irvine colleagues so that our design supports and is compatible with the Asterix Query Language (AQL). Our current approach is to create a higher level, user friendly event based system where the user could use events to monitor complex predicates of interest. This in turn will be internally translated to basic triggers by the system.
2. Filtering XML at very high speeds: As part of our task to design the ASTERIX pub/sub system we have been examining parallelism during filtering. We thus have considered filtering XML streams using novel hardware architectures, and in particular, FPGAs. FPGAs provide very high throughput for sequential tasks by exploring on-chip parallelism. Filtering an XML stream falls in this category. This work has led into two papers. The first appeared in [1] and addressed the problem of supporting simple XPath profiles on such a filtering environment. The work in [2] (under submission) solves the case where the user profiles are complex twigs. We are currently working on implementing our algorithms in other parallel architectures, like GPUs.
3. Query processing of semistructured data: We have worked on a unified approach to three basic problems in structural query processing, namely: XML filtering, XML stream processing (tuple extraction), and XML query processing. Previous approaches were shown to be efficient for one or two of these problems, but were either inefficient or not suitable for the third problem. We instead propose a unified approach used to devise efficient algorithms for all three problems. We represent the queries and XML documents using a sequential encoding, referred to as Node Encoded Tree Sequences (NETS). We then provide algorithms that can address all three problems efficiently, using the NETS sequences. This work has led to one paper under submission [3].
Publications:
[1] Roger Moussalli, Mariam Salloum, Walid A. Najjar, Vassilis J. Tsotras, “Accelerating XML Query Matching through Custom Stack Generation on FPGAs”, High Performance Embedded Architectures and Compilers, 5th International Conference, HiPEAC 2010, Lecture Notes in Computer Science 5952 Springer 2010, ISBN 978-3-642-11514-1 [pdf]
[2] Roger Moussalli, Mariam Salloum, Walid Najjar, Vassilis Tsotras, “Massively Parallel XML Twig Filtering on FPGAs”, submitted for publication, 2010.
[3] Mariam Salloum and Vassilis J. Tsotras, “A Unified Approach for Structural XML Processing”, submitted for publication, 2010.