Flexible Parsing (FP)

The Optimal Parsing for Dictionary Based Compression

Yossi Matias
(Tel-Aviv University and Bell Labs),
Nasir Rajpoot
(University of Warwick and Yale University), and
Cenk Sahinalp
(University of Warwick and UPenn Center for BioInformatics)

Introduction (LZW-FP)

Alternative FP (FPA)

Implementation

Experimental Results

Experimental Data Sets

Download

Documentation

Benchmark Suite

Introduction (LZW-FP)

Flexible parsing (FP) is a proposed extension for the common dictionary based compression schemes. The basic idea behind this algorithm is that we look one step ahead for the longest phrase in the dictionary instead of trying to find the longest possible phrase at hand. The FP is optimal for all dictionary schemes that satisfy the prefix property (which include most practical schemes such as Lempel-Ziv variants LZ-77, the UNIX gzip utility, and LZ-78, the UNIX compress utility.) LZW-FP is an implementation of the flexible parsing which employs the dictionary construction similar to the one in the popular LZW scheme. Please see the documents at the end of this page for a detailed description of the LZW-FP algorithm.

Alternative FP (FPA)

FPA uses the same flexible parsing scheme as described above. It differs from LZW-FP in that it has its own dictionary construction scheme which works as follows. After the parsing, which finds out the longest extending phrase in the next step, it adds to the dictionary the longest matching string and the character just after it (all of which forms a new phrase for the dictionary.)

Implementation

In our LZW-FP implementation, we limited the dictionary size to 2¹⁶ (64K) phrases, and reset it when it was full (as is the case with compress, except when it finds from the compression performance so far that there is no need for resetting the dictionary.) Two versions of FPA were implemented, one with the dictionary size same as that of LZW-FP and the other one with the dictionary having 2²⁴ phrases (we name it FPA-24.)

The preliminary implementation of flexible parsing in both LZW-FP and FPA is quite plain and can be regarded as a semi-quadratic one. Work is being carried out on their optimisation by incorporating more efficient data structures, and we hope to get significant improvement in terms of speed.

Experimental Results

The current implementation of LZW-FP and FPA has shown quite promising results over both compress and gzip. Following is a tabular presentation of the comparative compression performances of gzip, compress, LZW-FP, FPA, and FPA-24 when applied to four data sets:

DNA and protein sequences provided by the Center for BioInformatics, University of Pennsylvenia and CT scan and MR images provided by the Guys' and St. Thomas Hospital, London,

IID binary data obtained via UNIX C drand48(), a uniform distribution random number generator,

The new Canterbury corpus large data set, and

Some files from the Calgary corpus.

Table 1: Biological sequences and medical images

Table 2: IID pseudorandom binary sequences

Table 3: Canterbury corpus (large data set)

Table 4: Calgary corpus

From these tables, we can say that LZW-FP exhibits nice asymptotic properties for large size (roughly > 1MB) data files. FPA outperforms LZW-FP in almost every case, primarily because of efficient dictionary construction apart from the flexible parsing. FPA-24 outperforms both gzip and compress for all the files with size greater than 1MB.

Following is a graphical presentation of the fact that LZW-FP, FPA, and FPA-24 outperform both gzip and compress utilities for pseudorandom binary sequences, specially for larger data files. Furthermore, all of these FP-variants approach the entropy much faster than the gzip and compress do.

Figure 1: Compression performance for pseudorandom binary sequence

Figure 2: Approaching the entropy for pseudorandom binary sequence

Experimental Data Sets

Biological and medical data (4 files in gzipped tar format, 1964K)
Files from the Calgary corpus (19 files in gzipped tar format, 1047K)
Files from the Canterbury corpus, large set (3 files in gzipped tar format, 3183K)
Binary IID data (15 files in gzipped tar format, 932K)

Download

Please feel free to download the preliminary versions of LZW-FP and FPA packages (compiled for Sun Solaris and Irix) and do let us know how you found their efficiency and compression performance as compared to the standard UNIX compress and gzip utilities.

LZW-FP for Sun Solaris (compressed tar format, 71K)
FPA for Sun Solaris (compressed tar format, 70K)
LZW-FP-Bin for Sun Solaris for binary data only (compressed tar format, 21K)

LZW-FP for Irix (compressed tar format, 15K)
FPA for Irix (compressed tar format, 15K)
LZW-FP-Bin for Irix for binary data only (compressed tar format, 23K)

(Disclaimer! Although these software have been found to be free of bugs, please make sure that you have backups before compressing your data. We do not take responsibility for any loss of data arising from the use of these packages.)

Documentation

Y. Matias, N. Rajpoot, and S. C. Sahinalp, The effect of Flexible parsing for dynamic dictionary based data compression (compressed postscript format, 120K), Proceedings IEEE Data Compression Conference, (DCC'99), March 1999.

Y. Matias and S. C. Sahinalp, On the optimality of parsing in dynamic dictionary based data compression preliminary version, (compressed postscript format, 94K); a short summary that appeared in SODA'99.

Y. Matias, N. Rajpoot, and S. C. Sahinalp, Implementation and experimental evaluation of flexible parsing for dynamic dictionary based data compression (compressed postscript format, 82K), Second Workshop on Algorithm Engineering (WAE'98), August 1998.

Listed by Mathtools.net