Funded by the National Science Foundation
PI:
Vassilis J. Tsotras
Award
Number: IIS-0705916
Duration: 09/01/2007 through 08/31/2010
This
is a collaborative project with:
IIS-0705345
PI: Carlo Zaniolo
University of California, Los Angeles
and
IIS-0705589
PI: Victor Vianu,
Co-PI: Alin B Deutsch
University of California, San Diego
Web
Page:
http://www.cs.ucr.edu/~tsotras/meta-manager
Students:
Mariam
Salloum
Adam Woss
Wanxin
(Sarah) Xu
Project Summary:
Database schema evolution is a constant in the life-cycle of Information Systems, and a source of major costs for maintenance, upgrading, and service down time. The traditional process of installing a new schema, converting the database, and rewriting applications is slow and laborious. Instead, this project develops the novel technology whereby the schema evolution problem is reduced to coordinating mappings between multiple concurrent versions of schema, applications, and database. The project's approach consists of developing: (i) XML-based architectures for unifying the management of evolving data and metadata, (ii) methods for capturing evolution via schema mappings, (iii) efficient mapping techniques for queries and applications. These advances will enable the development of the MetaManager, a system that supports: (a) preserving and querying database histories, and (b) better planning on how-to evolve current schema versions, via evaluation and testing of ``what-if'' scenarios. Its functionality and performance will be validated using various testbeds, including the SDSC Storage Request Broker, which hosts scientific data for research groups ranging from astrophysicists to biologists. This novel and timely approach provides a solution to both the evolution and preservation of information systems. Because of the key role played by IS, a broad range of scientific, educational, and economic activities will benefit. Project funds will support training of PhD students while undergraduate research interns will train on schema-evolution scenarios using the MetaManager. The technology will become part of DB-design courses.
Research Activities – Year 1:
During the first year of this project we concentrated in implementing a prototype for supporting multiversion XML queries. Since the Meta Manager is an XML based system, we need efficient ways to manage multiple versions of XML data as well as address complex queries on such data. In particular we have assumed that an XML document evolves through linear versions (v1, v2, etc.) and examined how typical 'twig' structural queries can be examined over versions. Note that a typical twig query asks where in a document the structure of the twig appears. This structure looks like a tree (twig) and nodes are connected through parent/child and ancestor/descendant edges.
In a multiversion XML document, the user can specify a twig and a version (or version interval) and our system has to find where the query twig appeared in the version(s) of interest. Of course this query is trivial if someone decided to store snapshots of all versions. This however would take too much space. Instead we store 'deltas' between versions. We thus assign to a given node on the XML document a version interval (i.e., the versions for which this node is current). We would like an algorithm that can answer multiversion twig queries in the same time complexity as for an non-versioned XML document.
We explored the various XML processing techniques that have been proposed in literature for non-versioned environments and we divided them in three categories: (i) methods based on structural joins (Twigstack approach), (ii) methods based on subsequence matching (ViST, PriX etc.) and (iii) methods based on structural summaries. We have first considered the Twigstack approach. This approach divides document 'tags' into lists and answers the twig queries through joins among the query lists. While the original approach assumes an interval based numbering scheme, this is not appropriate for a versioning environment since it cannot easily support documents that change over versions (a new numbering of the whole document is needed, which is not practical). Instead we propose to use a dynamic numbering scheme (like ORDPATH). We have thus implemented TwigStack with the ORDPATH numbering scheme. In order to add versions, we have implemented a multiversion B-tree on top of each tag list. To answer a twig query on a specific version, the multiversion B-tree is first accessed and it thus provides access to this list as it was in the required version. We have finished the Twigstack implementation with versions and we are reporting our findings in [1].
It should be noted, that the above approach matches well with the environment presented in PRISM and PRIMA by our UCLA collaborators since the attribute columns used there can be easily supported by the multiversion lists we propose. Moreover, during the same period we have explored the use of structural summaries in XML query processing (this is one of the approaches that we will consider for multiversion support). These findings have been included in a tutorial [2]. We have also worked on two related papers [3,4]. In [3] we examine XML publish/subscribe systems and how their routing can be enhanced by the use of structural summaries. In [4] we concentrate on the XML filtering problem and provide efficient ways to prune user profiles thus leading to very fast query processing. Finally, in the same period we were also involved in various encyclopedia papers related to temporal query processing [5-7].
Research Activities – Year 2:
During the second year of this project we continued with the problem of supporting multiversion XML queries. First, we completed the work on linear versions. Here in addition to the Twigstack approach examined above, we implemented and tested two variations of the LCS-Trim approach. LCS-Trim transforms a twig query and the document into sequences and then performs the query using subsequence matching. In our experiments, the LCS-Trim approach when extended to use Multiversion B-trees performed faster in the majority of the scenarios than the Twigstack-based approach. These results were reported in [8].
Next, we concentrated on the case of XML documents with branched versions. Branched versioning is more complex as it allows one version to emanate from any previous version of the XML document (and not only the last version, as in linear versioning). In order to maintain compatibility with our previous implementations, we used the same ORDPATH numbering scheme for numbering the evolving XML documents. We examined how the state-of-the-art XML query processing techniques (namely Twigstack and LCS-Trim) can be adapted to the branched versioning framework. For both algorithms, one needs to provide a requested version of the XML document efficiently. By efficient we mean a method that does not create a separate copy for each version of the XML document. We thus presented a generic way to organize XML multiversion data using Branched-Temporal lists (BT-lists). These results were reported in [9] and will be submitted for publication.
We also continued our work on XML query processing techniques. In particular, we proposed a new sequence-based approach for efficient filtering of XML documents in a pub-sub system. Our approach transforms both XML documents and XPath twig expressions into Node Encoded Tree Sequences (NETS). In terms of this encoding, we provided a necessary and sufficient condition for a query to represent a match in a given XML tree. The proposed filtering procedure is based on a new subsequence matching algorithm devised for NETS, which identifies a set of matched queries free of false dismissals and false positives with a single scan of the XML document. Experimental results have shown that the proposed approach outperforms previous XML filtering approaches (like YFilter and FiST). The results were reported in [10].
We also examined new architecture paradigms for filtering XML documents. While there has been recently a lot of work proposing the use of multicores for achieving high throughput through parallelism, we took a different approach. In particular, we suggested that reconfigurable computing (FPGAs) can achieve extremely high throughputs for specific streaming applications. One such example is filtering of XML documents in pub-sub systems. In particular, we showed how simple path queries (expressed in XPath) can be implemented on FPGA chips. The XPath queries (subscriptions) are translated to regular expressions which are then mapped to FPGA devices. By introducing stacks within the FPGA we are able to express and process a wide range of path queries very efficiently, on a scalable environment. Moreover, the fact that the parser and the filter processing are performed on the same FPGA chip, eliminates expensive communication costs (that a multi-core system would need) thus enabling very fast and efficient pipelining. Our experimental evaluation revealed more than one order of magnitude improvement compared to traditional pub/sub systems. These results were reported in [11].
We are currently extending our work to include more complex XPath queries, allowing the use of recursion and wildcards in the query profile. We have devised a new approach that is more efficient and modular. Our results appear in [12].
Moreover, during this period we examined another problem related to versioning. In particular, we devised a flexible versioning scheme for network models. These models are frequently used as a mechanism to describe the connectivity information between spatial features in many emerging GIS applications. Our solution is based on the notion of dirty areas and dirty objects (i.e., regions or elements which contain conflicts between multiple versions) which are first identified and marked and subsequently cleaned. We have implemented a prototype of this versioning scheme and have used it extensively to support various applications which utilize network models. These findings appeared in [13]. A related demo was presented in [14].
Finally, during the same period, we co-edited an issue on 'XQuery Processing: Practice and Experience' for the Data Engineering Bulletin [15].
Publications:
[1] A. Woss, 'Twig Queries over Multiversion XML Documents', M.Sc. thesis, UC Riverside, 2008. [pdf]
[2] M. M. Moro, Z. Vagena, V.J. Tsotras, 'XML Structural Summaries', tutorial at VLDB Conference, 2008. [pdf]
[3] M. M. Moro, Z. Vagena, V.J. Tsotras, ''Recent Advances and Challenges in XML Document Routing.,'' February, Open and Novel Issues in XML Database Applications: Future Directions and Advanced Technologies, IGI, E. Pardede (Ed.), 2008. [pdf]
[4] M.M. Moro, P. Bakalov, M. Salloum, V.J. Tsotras, 'Fast XML Filtering Using Efficient Profile Pruning Techniques', submitted for publication.
[5] M.M. Moro, V.J. Tsotras, 'Valid-time Indexing', Encyclopedia of Database Systems, T. Ozsu and L. Liu (eds.) 2008. [pdf]
[6] M.M. Moro, V.J. Tsotras, 'Transaction-time Indexing' , Encyclopedia of Database Systems, T. Ozsu and L. Liu (eds.) 2008. [pdf]
[7] M.M. Moro, V.J. Tsotras 'Bi-temporal Indexing' Encyclopedia of Database Systems, T. Ozsu and L. Liu (eds.) 2008. [pdf]
[8] A. Woss, V.J. Tsotras, 'Experimental Evaluation of Query Processing Techniques over Multiversion XML Documents', Proceedings of 12th International Workshop on the Web and Databases (WebDB 2009). Providence, Rhode Island - June 28, 2009. [pdf]
[9] W. Xu, 'Structured Queries Over XML Documents with Branched Versioning', MSc. Thesis, UC Riverside, 2009. [pdf]
[10] M. Salloum, V.J. Tsotras, 'Efficient and Scalable Sequence-Based XML Filtering System', Proceedings of 12th International Workshop on the Web and Databases (WebDB 2009). Providence, Rhode Island - June 28, 2009. [pdf]
[11] Abhishek Mitra, Marcos R. Vieira, Petko Bakalov, Vassilis J. Tsotras, Walid A. Najjar: Boosting XML filtering through a scalable FPGA-based architecture. CIDR Fourth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2009. [pdf]
[12] R. Moussali, M. Salloum, W. Najjar, V.J. Tsotras. 'Custom stack generation for accelerated XML query matching on FPGAs', submitted for publication.
[13] Petko Bakalov, Erik Hoel, Sudhakar Menon and Vassilis J. Tsotras, 'Versioning of Network Models in a Multiuser Environment', Proceedings of 11th International Symposium on Spatial and Temporal Databases (SSTD), July 8-10, 2009, Aalborg, Denmark. [pdf]
[14] Petko Bakalov, Erik G. Hoel, Wee-Liang Heng, Vassilis J. Tsotras, 'Editing and versioning dynamic network models', Demo presented at the 16th ACM International Symposium on Advances in Geographic Information Systems, ACM-GIS 2008, November 5-7, 2008, Irvine. [pdf]
[15] Michael J. Carey, Vassilis J. Tsotras: Letter from the Special Issue Editors, 'XQuery Processing: Practice and Experience', IEEE Data Eng. Bull. 31(4): 6 (2008).