Compression of texts via greedy off-line textual substitution refers to the possibility to identify a particularly redundant word in the text and to replace all of its non-overlapped occurrences (but one) with pointers, in order to get the highest possible compression; the process is then iterated on the compressed text until a word capable of producing further compression can no longer be found.
Off-line is a tool written in C++/STL that implements the above strategy. Experimental results, implementation details and open problems are discussed in the papers below. In the following table we compare Off-Line3 with other compression programs on families of sequences of the yeast. The parameter k is the number of upstream sequences in each family. Individual sequence length is 800 bps except in the last two rows, where it is 2,000. The alphabet consists of about 50 symbols.
Family | Total size | Huffman | LZ-78 | LZ-77 | BWT | ||
(bytes) | k | Pack | Compress | GZip | BZip2 | Off-Line3 | |
Spor_EarlyII.fasta | 25,008 | 29 | 7,996 | 7,875 | 8,008 | 7,300 | 7,119 |
Spor_EarlyI.fasta | 31,039 | 36 | 9,937 | 9,646 | 9,862 | 9,045 | 8,697 |
Helden_GCN.fasta | 32,871 | 38 | 10,590 | 10,223 | 10,379 | 9,530 | 9,301 |
Spor_Middle.fasta | 54,325 | 63 | 17,295 | 16,395 | 16,961 | 15,490 | 14,778 |
Helden_All.fasta | 112,507 | 130 | 36,172 | 33,440 | 33,829 | 31,793 | 29,758 |
Spor_All.fasta | 222,453 | 258 | 70,755 | 63,939 | 68,136 | 61,674 | 54,317 |
Spor_All_2x.fasta | 444,906 | 516 | 141,431 | 124,637 | 135,816 | 85,142 | 65,891 |
All_Up_400k.fasta | 399,615 | 191 | 121,700 | 115,029 | 115,023 | 112,363 | 106,722 |
All_Up_1M.fasta | 1,001,002 | 477 | 305,054 | 286,971 | 285,064 | 280,334 | 268,612 |
Download Off-line. You will need the library zlib to compile Off-line. I have successfully compiled the code under Solaris and Linux using GCC or EGCS. The software comes under GPL: use it at your own risk!
A.Apostolico, S.Lonardi, "Off-line Compression by Greedy Textual Substitution", Proceedings of the IEEE, vol.88, no.11, November 2000 . Also TR-99-043, Dept. of Computer Science, Purdue University. (procieee.ps.gz, 239,193 bytes; procieee.pdf, 294,145 bytes). BibTex entry
A.Apostolico, S.Lonardi, "Compression of Biological Sequences by Greedy Off-line Textual Substitution", powerpoint slides presented at the Data Compression Conference (DCC 2000), Snowbird, Utah, March 27 - March 30, 2000. (dcc20000.zip, 1,290,431 bytes).
A.Apostolico, S.Lonardi, "Compression of Biological Sequences by Greedy Off-line Textual Substitution", Proceeding of the IEEE Data Compression Conference (DCC 2000), pp.143-152, Snowbird, Utah, March 27 - March 30, 2000. Also TR-99-037, Dept. of Computer Science, Purdue University. (dcc00.ps.gz, 140,654 bytes; dcc00.pdf, 269,371 bytes). BibTex entry
A.Apostolico, S.Lonardi, "Some Theory and Practice of Greedy Off-Line Textual Substitution", Proceeding of the IEEE Data Compression Conference (DCC'98), pp.119-128, Snowbird, Utah, March 30 - April 1, 1998. (dcc98.ps.gz, 53,464 bytes; dcc98.ps.gz, 1,716,188 bytes). BibTex entry
A.Apostolico, S.Lonardi, "Some Theory and Practice of Greedy Off-Line Textual Substitution", slides presented at the Data Compression Conference (DCC'98), Snowbird, Utah, March 30 - April 1, 1998. (dcc98-slides.ps.gz, 183,910 bytes; dcc98-slides.pdf, 253,921 bytes)
S.Lonardi, "Off-Line Data Compression by Textual Substitution", unpublished manuscript. (tpd.ps.gz, 519,741 bytes; tpd.pdf, 1,055,677 bytes)
© Copyright Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
National Science Foundation (Grant CCR-9700276)
Purdue Research Foundation (Grant 690-1398-3145)