The problem of characterizing and detecting over- or
under-represented words in sequences arises ubiquitously in
diverse applications and has been studied rather extensively in
Computational Molecular Biology. In most approaches to the detection
of unusual frequencies of words in sequences, the words (up to a
certain length) are enumerated more or less exhaustively and
individually checked in terms of observed and expected frequencies,
variances, and scores of discrepancy and significance thereof.
We take instead the global approach of annotating a suffix trie or automaton of a sequence with some such values and scores, with the objective of using it as a collective detector of all unexpected behaviors, or perhaps just as a preliminary filter for words suspicious enough to warrant further and more accurate scrutiny.
A.Apostolico, F.Gong, S.Lonardi, "Verbumculus and the Discovery of Unusual Words", Journal of Computer and Science Technology (special issue in Bioinformatics), vol.19, no.1, pp.22-41, 2004. BibTex entry
A.Apostolico, M.E.Bock, S.Lonardi, "Monotony of surprise and large-scale quest for unusual words", to appear in Journal of Computational Biology, July 2003. BibTex entry
A.Apostolico, M.E.Bock, S.Lonardi, "Monotony of surprise and large-scale quest for unusual words", Proceedings of ACM RECOMB, pp.22-31, Washington, DC, 2002. BibTex entry
S.Lonardi, "Global Detectors of Unusual Words: Design, Implementation, and Applications to Pattern Discovery in Biosequences", Ph.D. Thesis, Dept. of Computer Science, Purdue University, August 2001. BibTex entry
A.Apostolico, M.E.Bock, S.Lonardi, X.Xu, "Efficient Detection of Unusual Words", Journal of Computational Biology, vol.7, no.1/2, 2000. (jcb.ps.gz, 229,967 bytes; jcb.pdf, 655,877). BibTex entry and PubMed Abstract
A.Apostolico, M.E.Bock, S.Lonardi, "Linear Global Detectors of Redundant and Rare Substrings", Proceeding of the IEEE Data Compression Conference (DCC'99), pp.168-177, Snowbird, Utah, March 29-31, 1999. (dcc99.ps.gz, 100,668 bytes; dcc99.pdf, 332,063). BibTex entry© Copyright Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
National Science Foundation (Grant CCR-9700276)
Purdue Research Foundation (Grant 690-1398-3145)