Research: Graceful Evolution and Historical Queries in Information
Systems-- a Unified Approach "
Funded by the National
Vassilis J. Tsotras
Duration: 09/01/2007 through 08/31/2011
is a collaborative project with:
PI: Carlo Zaniolo
University of California, Los Angeles
PI: Victor Vianu,
Co-PI: Alin B Deutsch
University of California, San Diego
Md. Mahbub Hasan
Database schema evolution is a constant in the
life-cycle of Information Systems, and a source of major costs for
maintenance, upgrading, and service down time. The traditional
process of installing a new schema, converting the database, and
rewriting applications is slow and laborious. Instead, this project
develops the novel technology whereby the schema evolution problem is
reduced to coordinating mappings between multiple concurrent versions
of schema, applications, and database. The project's approach
consists of developing: (i) XML-based architectures for unifying the
management of evolving data and metadata, (ii) methods for capturing
evolution via schema mappings, (iii) efficient mapping techniques for
queries and applications. These advances will enable the development
of the MetaManager, a system that supports: (a) preserving and
querying database histories, and (b) better planning on how-to evolve
current schema versions, via evaluation and testing of ``what-if''
scenarios. Its functionality and performance will be validated using
various testbeds, including the SDSC Storage Request Broker, which
hosts scientific data for research groups ranging from
astrophysicists to biologists. This novel and timely approach
provides a solution to both the evolution and preservation of
information systems. Because of the key role played by IS, a broad
range of scientific, educational, and economic activities will
benefit. Project funds will support training of PhD students while
undergraduate research interns will train on schema-evolution
scenarios using the MetaManager. The technology will become part of
Research Activities – Year 1:
During the first year of this project we concentrated in
implementing a prototype for supporting multiversion XML queries.
Since the Meta Manager is an XML based system, we need efficient ways
to manage multiple versions of XML data as well as address complex
queries on such data. In particular we have assumed that an XML
document evolves through linear versions (v1, v2, etc.) and
examined how typical 'twig' structural queries can be examined over
versions. Note that a typical twig query asks where in a document the
structure of the twig appears. This structure looks like a tree
(twig) and nodes are connected through parent/child and
In a multiversion XML document, the user can specify a
twig and a version (or version interval) and our system has to find
where the query twig appeared in the version(s) of interest. Of
course this query is trivial if someone decided to store snapshots of
all versions. This however would take too much space. Instead we
store 'deltas' between versions. We thus assign to a given node on
the XML document a version interval (i.e., the versions for which
this node is current). We would like an algorithm that can answer
multiversion twig queries in the same time complexity as for an
non-versioned XML document.
We explored the various XML processing techniques that
have been proposed in literature for non-versioned environments and
we divided them in three categories: (i) methods based on structural
joins (Twigstack approach), (ii) methods based on subsequence
matching (ViST, PriX etc.) and (iii) methods based on structural
summaries. We have first considered the Twigstack approach. This
approach divides document 'tags' into lists and answers the twig
queries through joins among the query lists. While the original
approach assumes an interval based numbering scheme, this is not
appropriate for a versioning environment since it cannot easily
support documents that change over versions (a new numbering of the
whole document is needed, which is not practical). Instead we propose
to use a dynamic numbering scheme (like ORDPATH). We have thus
implemented TwigStack with the ORDPATH numbering scheme. In order to
add versions, we have implemented a multiversion B-tree on top of
each tag list. To answer a twig query on a specific version, the
multiversion B-tree is first accessed and it thus provides access to
this list as it was in the required version. We have finished the
Twigstack implementation with versions and we are reporting our
findings in .
It should be noted, that the above approach matches well
with the environment presented in PRISM and PRIMA by our UCLA
collaborators since the attribute columns used there can be easily
supported by the multiversion lists we propose. Moreover, during the
same period we have explored the use of structural summaries in XML
query processing (this is one of the approaches that we will consider
for multiversion support). These findings have been included in a
tutorial . We have also worked on two related problems. In 
we examine XML publish/subscribe systems and how their routing can be
enhanced by the use of structural summaries. We also started working on
the XML filtering problem, looking for efficient ways to prune user
profiles thus leading to very fast query processing. Finally, in the
same period we were also involved in various encyclopedia papers
related to temporal query processing [4-6].
Research Activities – Year 2:
During the second year of this project we continued with
the problem of supporting multiversion XML queries. First, we
completed the work on linear versions. Here in addition to the
Twigstack approach examined above, we implemented and tested two
variations of the LCS-Trim approach. LCS-Trim transforms a twig query
and the document into sequences and then performs the query using
subsequence matching. In our experiments, the LCS-Trim approach when
extended to use Multiversion B-trees performed faster in the majority
of the scenarios than the Twigstack-based approach. These results
were reported in .
Next, we concentrated on the case of XML documents with
branched versions. Branched versioning is more complex as it
allows one version to emanate from any previous version of the XML
document (and not only the last version, as in linear versioning). In
order to maintain compatibility with our previous implementations, we
used the same ORDPATH numbering scheme for numbering the evolving XML
documents. We examined how the state-of-the-art XML query processing
techniques (namely Twigstack and LCS-Trim) can be adapted to the
branched versioning framework. For both algorithms, one needs to
provide a requested version of the XML document efficiently. By
efficient we mean a method that does not create a separate copy for
each version of the XML document. We thus presented a generic way to
organize XML multiversion data using Branched-Temporal lists
(BT-lists). These results were reported in .
We also continued our work on XML query processing
techniques. In particular, we proposed a new sequence-based approach
for efficient filtering of XML documents in a pub-sub system. Our
approach transforms both XML documents and XPath twig expressions
into Node Encoded Tree Sequences (NETS). In terms of this encoding,
we provided a necessary and sufficient condition for a query to
represent a match in a given XML tree. The proposed filtering
procedure is based on a new subsequence matching algorithm devised
for NETS, which identifies a set of matched queries free of false
dismissals and false positives with a single scan of the XML
document. Experimental results have shown that the proposed approach
outperforms previous XML filtering approaches (like YFilter and
FiST). The results were reported in .
We also examined new architecture paradigms for
filtering XML documents. While there has been recently a lot of work
proposing the use of multicores for achieving high throughput through
parallelism, we took a different approach. In particular, we
suggested that reconfigurable computing (FPGAs) can achieve extremely
high throughputs for specific streaming applications. One such
example is filtering of XML documents in pub-sub systems. In
particular, we showed how simple path queries (expressed in XPath)
can be implemented on FPGA chips. The XPath queries (subscriptions)
are translated to regular expressions which are then mapped to FPGA
devices. By introducing stacks within the FPGA we are able to express
and process a wide range of path queries very efficiently, on a
scalable environment. Moreover, the fact that the parser and the
filter processing are performed on the same FPGA chip, eliminates
expensive communication costs (that a multi-core system would need)
thus enabling very fast and efficient pipelining. Our experimental
evaluation revealed more than one order of magnitude improvement
compared to traditional pub/sub systems. These results were reported
We are currently extending our work to include more
complex XPath queries, allowing the use of recursion and wildcards in
the query profile. We have devised a new approach that is more
efficient and modular. Our results appear in .
Moreover, during this period we examined another problem
related to versioning. In particular, we devised a flexible
versioning scheme for network models. These models are frequently
used as a mechanism to describe the connectivity information between
spatial features in many emerging GIS applications. Our solution is
based on the notion of dirty areas and dirty objects (i.e., regions
or elements which contain conflicts between multiple versions) which
are first identified and marked and subsequently cleaned. We have
implemented a prototype of this versioning scheme and have used it
extensively to support various applications which utilize network
models. These findings appeared in . A related demo was presented
Finally, during the same period, the PI co-edited an issue
on 'XQuery Processing: Practice and Experience' for the Data
Engineering Bulletin .
Research Activities – Year 3:
In the first two years we concentrated on the indexing methods needed to
support historical XML queries over data that is versioned using either linear and
branched versions; nevertheless, the schema remained unchanged. Instead, in the
third year of this project we focused on extending the historical XML queries
when the database schema evolves. This is supplementary to the PRIMA work from
our UCLA collaborators, where the schema is assumed to evolve linearly (i.e., as a
sequence of versions). In our environment we consider the case where the schema
and the data evolve using ''branched'' versioning (i.e., a tree of versions).
Many complex applications may have more complex schema evolution than linear.
For example in revision control and software configuration management, brancing allows
the duplication of an object under revision (such as a source code file, or a directory
tree) so that modifications can happen in parallel along many branches. We have identified
two different branching cases. In c-branching, a new branch is created from the most current
schema, using the currently valid data under this schema. After the branch occurs, the two
schemas can evolve in parallel (with different updates etc.) There is also h-branching,
where branching occurs from a historical schema (and using the data valid at that time).
For simplicity, we have concentrated first on c-branching (while the h-branching will be
A major difference when branching occurs is that a single time-stamp does not
anymore uniquely identifies a version (as in the linear versioning case). Thus the
branch-id is another attribute we need to add on time-stamps. When a schema c-branch
is created, each data table on the parent schema can either be copied to the new schema,
or, be shared. The copy approach effectively creates one linear evolution per branch,
in which case one could use directly the PRIMA techniques (SMOs etc.) However, this
creates significant space overhead from the copied data. Instead we examined the 'shared'
option and provided policies for updating the data on each version they belong. As in PRIMA
we use XML to present and maintain the schema changes (this is done within the BMV-document
which is an extension of PRIMA's MV-document to branched versions). In  we present
techniques and algorithms for querying the data in the presence of branched schema versions.
This work will continue over the no-cost extension in collaboration with Prof. Zaniolo at UCLA.
We further continued our collaboration with ESRI on versioning and editing network
models; this led to a journal submission .
We further worked on a unified approach to solve three basic XML processing problems,
namely: XML filtering, XML stream processing (tuple extraction), and XML query processing. Previous
approaches were shown to be efficient for one or two of these problems, but were either inefficient
or not suitable for the third problem. In  we present a unified approach used to devise
efficient algorithms for all three problems. We represent the queries and XML documents using
our Node Encoded Tree Sequences (NETS) . The query matching is composed of two procedures,
subsequence matching and structural matching. The solution of the subsequence matching problem
is based on the classical dynamic programming recurrence relation for the Longest Common Subsequence
(LCS) problem. A new necessary and sufficient condition is presented which provides a simple
verification procedure for structural matching. We present an efficient algorithm for XML filtering
and tuple-extraction problems, where subsequence and structural matching are performed concurrently.
For the query processing problem we implemented and evaluated three new algorithms. Experimental
results show that our algorithms outperform state of the art approaches for the XML structural
processing problems examined.
Recently we have also worked on the problem of providing diverse answers to queries. This is
useful for queries that return large answer sets; the user may be interested in seeing quickly
the most diverse among the answers. In  we describe a general framework for evaluation and
optimization of methods for diversifying query results. In these methods, an initial ranking
answer set produced by a query is used to construct a result set, where elements are ranked
with respect to relevance and diversity features, i.e., the retrieved elements should be as
relevant as possible to the query, and, at the same time, the result set should be as diverse
as possible. Using the above framework, we adapted, implemented and evaluated several existing
methods for diversifying query results and proposed two new approaches, namely the Greedy with
Marginal Contribution (GMC) and the Greedy Randomized with Neighborhood Expansion (GNE) methods.
Our experimental results show that while GNE and GMC have higher running times, they achieve
precision very close to the optimal, while also providing the best result quality. More recently,
we examined the diversity problem over XML queries. The difference is that XML data can also
differ in complex structure (not only in values) thus efficient ways to define and compute
diversity among XML documents are needed. In  we provide efficient algorithms to compute
structural and value diversity among XML query results.
In  we present a study of top-k search in social tagging sites by utilizing multiple social
networks and temporal information. In particular, besides the global connection, we consider two
main social networks, namely the 'friendship' and the 'common interest' networks in our scoring
functions. Based on the degree of participation in various networks, users can be categorized
into specific classes that differ in their weights on each scoring component. Temporal information,
usually ignored by previous works, can enhance the popularity and freshness of the ranking results.
Experiments and evaluations on real social tagging datasets show that our framework works well in
practice and give useful and intuitive results.
In  we examine the problem of identifying influential events from spatiotemporal streams
of news data. Consider a set of sources of streaming documents, placed on fixed locations on a
2-dimensional map. These text streams can be thought of as daily news-sites or blogs from different
countries that record significant events from their respective locations. Each event (or topic) is
characterized by a set of terms that describe its different aspects (e.g. descriptive keywords or
names of places and people involved). During the timeframe in which an event is covered in the
streams, such terms are used repeatedly in the relevant documents that appear in the mainstream
and social media. We show how we can index this knowledge to efficiently evaluate multi-term
queries and retrieve matching documents that discuss influential events with a significant
spatiotemporal impact. This work has been submitted to ICDE 2011.
Finally, during the same period, the PI co-edited an issue
on 'Result Diversity' for the Data
Engineering Bulletin .
 A. Woss, 'Twig Queries over Multiversion XML
Documents', M.Sc. thesis, UC Riverside, 2008. [pdf]
 M. M. Moro, Z. Vagena, V.J. Tsotras, 'XML Structural
Summaries', tutorial at VLDB Conference, 2008. [pdf]
 M. M. Moro, Z. Vagena, V.J. Tsotras, ''Recent
Advances and Challenges in XML Document Routing.,'' February, Open
and Novel Issues in XML Database Applications: Future Directions and
Advanced Technologies, IGI, E. Pardede (Ed.), 2008. [pdf]
 M.M. Moro, V.J. Tsotras, 'Valid-time Indexing',
of Database Systems, T. Ozsu and L. Liu (eds.) 2008. [pdf]
 M.M. Moro, V.J. Tsotras, 'Transaction-time Indexing'
of Database Systems, T. Ozsu and L. Liu (eds.) 2008. [pdf]
 M.M. Moro, V.J. Tsotras
'Bi-temporal Indexing' Encyclopedia
of Database Systems, T. Ozsu and L. Liu (eds.) 2008. [pdf]
 A. Woss, V.J. Tsotras, 'Experimental Evaluation of
Query Processing Techniques over Multiversion XML Documents',
Proceedings of 12th International Workshop on the Web and Databases
(WebDB 2009). Providence, Rhode Island - June 28, 2009. [pdf]
 W. Xu, 'Structured Queries Over XML Documents with
Branched Versioning', MSc. Thesis, UC Riverside, 2009. [pdf]
 M. Salloum, V.J. Tsotras, 'Efficient and Scalable
Sequence-Based XML Filtering System', Proceedings of 12th
International Workshop on the Web and Databases (WebDB 2009).
Providence, Rhode Island - June 28, 2009. [pdf]
 Abhishek Mitra, Marcos R. Vieira, Petko Bakalov,
Vassilis J. Tsotras, Walid A. Najjar: Boosting XML filtering through
a scalable FPGA-based architecture. CIDR Fourth Biennial Conference
on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7,
 R. Moussali, M. Salloum, W. Najjar, V.J. Tsotras.
'Custom stack generation for accelerated XML query matching on
FPGAs', submitted for publication.
 Petko Bakalov, Erik Hoel, Sudhakar Menon and
Vassilis J. Tsotras, 'Versioning of Network Models in a Multiuser
Environment', Proceedings of 11th International Symposium on Spatial
and Temporal Databases (SSTD), July 8-10, 2009, Aalborg, Denmark.
 Petko Bakalov, Erik G. Hoel, Wee-Liang Heng,
Vassilis J. Tsotras, 'Editing and versioning dynamic network models',
Demo presented at the 16th ACM International Symposium on Advances in
Geographic Information Systems, ACM-GIS 2008, November 5-7, 2008,
 Michael J. Carey, Vassilis J. Tsotras: Letter from
the Special Issue Editors, 'XQuery
Processing: Practice and Experience', IEEE Data Eng. Bull. 31(4):
 Wenyu Huo, Vassilis J. Tsotras, "Efficient Management and Querying
of Transaction-time Databases under Branched Schema Evolution", UCR TR, July 2010.
 Petko Bakalov, Erik Hoel, Sudhakar Menon, and Vassilis J. Tsotras,
"Versioning and Editing of Network Models in a Multiuser Environment", submitted for publication.
 Mariam Salloum and Vassilis J. Tsotras, “A Unified Approach for Structural
XML Processing”, submitted for publication.
 Marcos R. Vieira, Humberto L. Razente, Maria C. N. Barioni, Marios Hadjieleftheriou,
Divesh Srivastava, Caetano Traina Jr., Vassilis J. Tsotras: “On Query Result Diversification,”
submitted for publication.
 Md. Mahbub Hasan, Vassilis J. Tsotras: "On XML query result diversity," under submission.
 Wenyu Huo, Vassilis J. Tsotras: "Temporal Top-k Search in Social
Tagging Sites Using Multiple Social Networks,"
link, In Proceedings of DASFAA, 2010, pp: 498-504.
 Theodoros Lappas, Marcos Vieira, Dimitris Gunopulos, Vassilis J. Tsotras: “Searching for
Documents on Influential Events through Space and Time,” submitted for publication.
 Marios Hadjieleftheriou, Vassilis J. Tsotras: Letter from
the Special Issue Editors, 'Result Diversity', IEEE Data Eng. Bull. 32(4):