"Collaborative Research: Graceful Evolution and Historical Queries in Information Systems-- a Unified Approach "

Funded by the National Science Foundation

PI: Vassilis J. Tsotras
Award Number: IIS-0705916
Duration: 09/01/2007 through 08/31/2011

This is a collaborative project with:
IIS-0705345
PI: Carlo Zaniolo
University of California, Los Angeles
and
IIS-0705589
PI: Victor Vianu, Co-PI: Alin B Deutsch
University of California, San Diego

Web Page:
http://www.cs.ucr.edu/~tsotras/meta-manager

Students:
Mariam Salloum
Adam Woss
Wanxin (Sarah) Xu
Wenyu Huo
Md. Mahbub Hasan

Project Summary:

Database schema evolution is a constant in the life-cycle of Information Systems, and a source of major costs for maintenance, upgrading, and service down time. The traditional process of installing a new schema, converting the database, and rewriting applications is slow and laborious. Instead, this project develops the novel technology whereby the schema evolution problem is reduced to coordinating mappings between multiple concurrent versions of schema, applications, and database. The project's approach consists of developing: (i) XML-based architectures for unifying the management of evolving data and metadata, (ii) methods for capturing evolution via schema mappings, (iii) efficient mapping techniques for queries and applications. These advances will enable the development of the MetaManager, a system that supports: (a) preserving and querying database histories, and (b) better planning on how-to evolve current schema versions, via evaluation and testing of ``what-if'' scenarios. Its functionality and performance will be validated using various testbeds, including the SDSC Storage Request Broker, which hosts scientific data for research groups ranging from astrophysicists to biologists. This novel and timely approach provides a solution to both the evolution and preservation of information systems. Because of the key role played by IS, a broad range of scientific, educational, and economic activities will benefit. Project funds will support training of PhD students while undergraduate research interns will train on schema-evolution scenarios using the MetaManager. The technology will become part of DB-design courses.

Research Activities – Year 1:

During the first year of this project we concentrated in implementing a prototype for supporting multiversion XML queries. Since the Meta Manager is an XML based system, we need efficient ways to manage multiple versions of XML data as well as address complex queries on such data. In particular we have assumed that an XML document evolves through linear versions (v1, v2, etc.) and examined how typical 'twig' structural queries can be examined over versions. Note that a typical twig query asks where in a document the structure of the twig appears. This structure looks like a tree (twig) and nodes are connected through parent/child and ancestor/descendant edges.

In a multiversion XML document, the user can specify a twig and a version (or version interval) and our system has to find where the query twig appeared in the version(s) of interest. Of course this query is trivial if someone decided to store snapshots of all versions. This however would take too much space. Instead we store 'deltas' between versions. We thus assign to a given node on the XML document a version interval (i.e., the versions for which this node is current). We would like an algorithm that can answer multiversion twig queries in the same time complexity as for an non-versioned XML document.

We explored the various XML processing techniques that have been proposed in literature for non-versioned environments and we divided them in three categories: (i) methods based on structural joins (Twigstack approach), (ii) methods based on subsequence matching (ViST, PriX etc.) and (iii) methods based on structural summaries. We have first considered the Twigstack approach. This approach divides document 'tags' into lists and answers the twig queries through joins among the query lists. While the original approach assumes an interval based numbering scheme, this is not appropriate for a versioning environment since it cannot easily support documents that change over versions (a new numbering of the whole document is needed, which is not practical). Instead we propose to use a dynamic numbering scheme (like ORDPATH). We have thus implemented TwigStack with the ORDPATH numbering scheme. In order to add versions, we have implemented a multiversion B-tree on top of each tag list. To answer a twig query on a specific version, the multiversion B-tree is first accessed and it thus provides access to this list as it was in the required version. We have finished the Twigstack implementation with versions and we are reporting our findings in [1].

It should be noted, that the above approach matches well with the environment presented in PRISM and PRIMA by our UCLA collaborators since the attribute columns used there can be easily supported by the multiversion lists we propose. Moreover, during the same period we have explored the use of structural summaries in XML query processing (this is one of the approaches that we will consider for multiversion support). These findings have been included in a tutorial [2]. We have also worked on two related problems. In [3] we examine XML publish/subscribe systems and how their routing can be enhanced by the use of structural summaries. We also started working on the XML filtering problem, looking for efficient ways to prune user profiles thus leading to very fast query processing. Finally, in the same period we were also involved in various encyclopedia papers related to temporal query processing [4-6].

Research Activities – Year 2:

During the second year of this project we continued with the problem of supporting multiversion XML queries. First, we completed the work on linear versions. Here in addition to the Twigstack approach examined above, we implemented and tested two variations of the LCS-Trim approach. LCS-Trim transforms a twig query and the document into sequences and then performs the query using subsequence matching. In our experiments, the LCS-Trim approach when extended to use Multiversion B-trees performed faster in the majority of the scenarios than the Twigstack-based approach. These results were reported in [7].

Next, we concentrated on the case of XML documents with branched versions. Branched versioning is more complex as it allows one version to emanate from any previous version of the XML document (and not only the last version, as in linear versioning). In order to maintain compatibility with our previous implementations, we used the same ORDPATH numbering scheme for numbering the evolving XML documents. We examined how the state-of-the-art XML query processing techniques (namely Twigstack and LCS-Trim) can be adapted to the branched versioning framework. For both algorithms, one needs to provide a requested version of the XML document efficiently. By efficient we mean a method that does not create a separate copy for each version of the XML document. We thus presented a generic way to organize XML multiversion data using Branched-Temporal lists (BT-lists). These results were reported in [8].

We also continued our work on XML query processing techniques. In particular, we proposed a new sequence-based approach for efficient filtering of XML documents in a pub-sub system. Our approach transforms both XML documents and XPath twig expressions into Node Encoded Tree Sequences (NETS). In terms of this encoding, we provided a necessary and sufficient condition for a query to represent a match in a given XML tree. The proposed filtering procedure is based on a new subsequence matching algorithm devised for NETS, which identifies a set of matched queries free of false dismissals and false positives with a single scan of the XML document. Experimental results have shown that the proposed approach outperforms previous XML filtering approaches (like YFilter and FiST). The results were reported in [9].

We also examined new architecture paradigms for filtering XML documents. While there has been recently a lot of work proposing the use of multicores for achieving high throughput through parallelism, we took a different approach. In particular, we suggested that reconfigurable computing (FPGAs) can achieve extremely high throughputs for specific streaming applications. One such example is filtering of XML documents in pub-sub systems. In particular, we showed how simple path queries (expressed in XPath) can be implemented on FPGA chips. The XPath queries (subscriptions) are translated to regular expressions which are then mapped to FPGA devices. By introducing stacks within the FPGA we are able to express and process a wide range of path queries very efficiently, on a scalable environment. Moreover, the fact that the parser and the filter processing are performed on the same FPGA chip, eliminates expensive communication costs (that a multi-core system would need) thus enabling very fast and efficient pipelining. Our experimental evaluation revealed more than one order of magnitude improvement compared to traditional pub/sub systems. These results were reported in [10].

We are currently extending our work to include more complex XPath queries, allowing the use of recursion and wildcards in the query profile. We have devised a new approach that is more efficient and modular. Our results appear in [11].

Moreover, during this period we examined another problem related to versioning. In particular, we devised a flexible versioning scheme for network models. These models are frequently used as a mechanism to describe the connectivity information between spatial features in many emerging GIS applications. Our solution is based on the notion of dirty areas and dirty objects (i.e., regions or elements which contain conflicts between multiple versions) which are first identified and marked and subsequently cleaned. We have implemented a prototype of this versioning scheme and have used it extensively to support various applications which utilize network models. These findings appeared in [12]. A related demo was presented in [13].

Finally, during the same period, the PI co-edited an issue on 'XQuery Processing: Practice and Experience' for the Data Engineering Bulletin [14].

Research Activities – Year 3:

In the first two years we concentrated on the indexing methods needed to support historical XML queries over data that is versioned using either linear and branched versions; nevertheless, the schema remained unchanged. Instead, in the third year of this project we focused on extending the historical XML queries when the database schema evolves. This is supplementary to the PRIMA work from our UCLA collaborators, where the schema is assumed to evolve linearly (i.e., as a sequence of versions). In our environment we consider the case where the schema and the data evolve using ''branched'' versioning (i.e., a tree of versions).

Many complex applications may have more complex schema evolution than linear. For example in revision control and software configuration management, brancing allows the duplication of an object under revision (such as a source code file, or a directory tree) so that modifications can happen in parallel along many branches. We have identified two different branching cases. In c-branching, a new branch is created from the most current schema, using the currently valid data under this schema. After the branch occurs, the two schemas can evolve in parallel (with different updates etc.) There is also h-branching, where branching occurs from a historical schema (and using the data valid at that time). For simplicity, we have concentrated first on c-branching (while the h-branching will be examined later).

A major difference when branching occurs is that a single time-stamp does not anymore uniquely identifies a version (as in the linear versioning case). Thus the branch-id is another attribute we need to add on time-stamps. When a schema c-branch is created, each data table on the parent schema can either be copied to the new schema, or, be shared. The copy approach effectively creates one linear evolution per branch, in which case one could use directly the PRIMA techniques (SMOs etc.) However, this creates significant space overhead from the copied data. Instead we examined the 'shared' option and provided policies for updating the data on each version they belong. As in PRIMA we use XML to present and maintain the schema changes (this is done within the BMV-document which is an extension of PRIMA's MV-document to branched versions). In [15] we present techniques and algorithms for querying the data in the presence of branched schema versions. This work will continue over the no-cost extension in collaboration with Prof. Zaniolo at UCLA.

We further continued our collaboration with ESRI on versioning and editing network models; this led to a journal submission [16].

We further worked on a unified approach to solve three basic XML processing problems, namely: XML filtering, XML stream processing (tuple extraction), and XML query processing. Previous approaches were shown to be efficient for one or two of these problems, but were either inefficient or not suitable for the third problem. In [17] we present a unified approach used to devise efficient algorithms for all three problems. We represent the queries and XML documents using our Node Encoded Tree Sequences (NETS) [9]. The query matching is composed of two procedures, subsequence matching and structural matching. The solution of the subsequence matching problem is based on the classical dynamic programming recurrence relation for the Longest Common Subsequence (LCS) problem. A new necessary and sufficient condition is presented which provides a simple verification procedure for structural matching. We present an efficient algorithm for XML filtering and tuple-extraction problems, where subsequence and structural matching are performed concurrently. For the query processing problem we implemented and evaluated three new algorithms. Experimental results show that our algorithms outperform state of the art approaches for the XML structural processing problems examined.

Recently we have also worked on the problem of providing diverse answers to queries. This is useful for queries that return large answer sets; the user may be interested in seeing quickly the most diverse among the answers. In [18] we describe a general framework for evaluation and optimization of methods for diversifying query results. In these methods, an initial ranking answer set produced by a query is used to construct a result set, where elements are ranked with respect to relevance and diversity features, i.e., the retrieved elements should be as relevant as possible to the query, and, at the same time, the result set should be as diverse as possible. Using the above framework, we adapted, implemented and evaluated several existing methods for diversifying query results and proposed two new approaches, namely the Greedy with Marginal Contribution (GMC) and the Greedy Randomized with Neighborhood Expansion (GNE) methods. Our experimental results show that while GNE and GMC have higher running times, they achieve precision very close to the optimal, while also providing the best result quality. More recently, we examined the diversity problem over XML queries. The difference is that XML data can also differ in complex structure (not only in values) thus efficient ways to define and compute diversity among XML documents are needed. In [19] we provide efficient algorithms to compute structural and value diversity among XML query results.

In [20] we present a study of top-k search in social tagging sites by utilizing multiple social networks and temporal information. In particular, besides the global connection, we consider two main social networks, namely the 'friendship' and the 'common interest' networks in our scoring functions. Based on the degree of participation in various networks, users can be categorized into specific classes that differ in their weights on each scoring component. Temporal information, usually ignored by previous works, can enhance the popularity and freshness of the ranking results. Experiments and evaluations on real social tagging datasets show that our framework works well in practice and give useful and intuitive results.

In [21] we examine the problem of identifying influential events from spatiotemporal streams of news data. Consider a set of sources of streaming documents, placed on fixed locations on a 2-dimensional map. These text streams can be thought of as daily news-sites or blogs from different countries that record significant events from their respective locations. Each event (or topic) is characterized by a set of terms that describe its different aspects (e.g. descriptive keywords or names of places and people involved). During the timeframe in which an event is covered in the streams, such terms are used repeatedly in the relevant documents that appear in the mainstream and social media. We show how we can index this knowledge to efficiently evaluate multi-term queries and retrieve matching documents that discuss influential events with a significant spatiotemporal impact. This work has been submitted to ICDE 2011.

Finally, during the same period, the PI co-edited an issue on 'Result Diversity' for the Data Engineering Bulletin [22].

Publications:

[1] A. Woss, 'Twig Queries over Multiversion XML Documents', M.Sc. thesis, UC Riverside, 2008. [pdf]

[2] M. M. Moro, Z. Vagena, V.J. Tsotras, 'XML Structural Summaries', tutorial at VLDB Conference, 2008. [pdf]

[3] M. M. Moro, Z. Vagena, V.J. Tsotras, ''Recent Advances and Challenges in XML Document Routing.,'' February, Open and Novel Issues in XML Database Applications: Future Directions and Advanced Technologies, IGI, E. Pardede (Ed.), 2008. [pdf]

[4] M.M. Moro, V.J. Tsotras, 'Valid-time Indexing', Encyclopedia of Database Systems, T. Ozsu and L. Liu (eds.) 2008. [pdf]

[5] M.M. Moro, V.J. Tsotras, 'Transaction-time Indexing' , Encyclopedia of Database Systems, T. Ozsu and L. Liu (eds.) 2008. [pdf]

[6] M.M. Moro, V.J. Tsotras 'Bi-temporal Indexing' Encyclopedia of Database Systems, T. Ozsu and L. Liu (eds.) 2008. [pdf]

[7] A. Woss, V.J. Tsotras, 'Experimental Evaluation of Query Processing Techniques over Multiversion XML Documents', Proceedings of 12th International Workshop on the Web and Databases (WebDB 2009). Providence, Rhode Island - June 28, 2009. [pdf]

[8] W. Xu, 'Structured Queries Over XML Documents with Branched Versioning', MSc. Thesis, UC Riverside, 2009. [pdf]

[9] M. Salloum, V.J. Tsotras, 'Efficient and Scalable Sequence-Based XML Filtering System', Proceedings of 12th International Workshop on the Web and Databases (WebDB 2009). Providence, Rhode Island - June 28, 2009. [pdf]

[10] Abhishek Mitra, Marcos R. Vieira, Petko Bakalov, Vassilis J. Tsotras, Walid A. Najjar: Boosting XML filtering through a scalable FPGA-based architecture. CIDR Fourth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2009. [pdf]

[11] R. Moussali, M. Salloum, W. Najjar, V.J. Tsotras. 'Custom stack generation for accelerated XML query matching on FPGAs', submitted for publication.

[12] Petko Bakalov, Erik Hoel, Sudhakar Menon and Vassilis J. Tsotras, 'Versioning of Network Models in a Multiuser Environment', Proceedings of 11th International Symposium on Spatial and Temporal Databases (SSTD), July 8-10, 2009, Aalborg, Denmark. [pdf]

[13] Petko Bakalov, Erik G. Hoel, Wee-Liang Heng, Vassilis J. Tsotras, 'Editing and versioning dynamic network models', Demo presented at the 16th ACM International Symposium on Advances in Geographic Information Systems, ACM-GIS 2008, November 5-7, 2008, Irvine. [pdf]

[14] Michael J. Carey, Vassilis J. Tsotras: Letter from the Special Issue Editors, 'XQuery Processing: Practice and Experience', IEEE Data Eng. Bull. 31(4): 6 (2008).

[15] Wenyu Huo, Vassilis J. Tsotras, "Efficient Management and Querying of Transaction-time Databases under Branched Schema Evolution", UCR TR, July 2010.

[16] Petko Bakalov, Erik Hoel, Sudhakar Menon, and Vassilis J. Tsotras, "Versioning and Editing of Network Models in a Multiuser Environment", submitted for publication.

[17] Mariam Salloum and Vassilis J. Tsotras, “A Unified Approach for Structural XML Processing”, submitted for publication.

[18] Marcos R. Vieira, Humberto L. Razente, Maria C. N. Barioni, Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina Jr., Vassilis J. Tsotras: “On Query Result Diversification,” submitted for publication.

[19] Md. Mahbub Hasan, Vassilis J. Tsotras: "On XML query result diversity," under submission.

[20] Wenyu Huo, Vassilis J. Tsotras: "Temporal Top-k Search in Social Tagging Sites Using Multiple Social Networks," link, In Proceedings of DASFAA, 2010, pp: 498-504.

[21] Theodoros Lappas, Marcos Vieira, Dimitris Gunopulos, Vassilis J. Tsotras: “Searching for Documents on Influential Events through Space and Time,” submitted for publication.

[22] Marios Hadjieleftheriou, Vassilis J. Tsotras: Letter from the Special Issue Editors, 'Result Diversity', IEEE Data Eng. Bull. 32(4): 6 (2009).