Flexible Parsing (FP)

The Optimal Parsing for Dictionary Based Compression Yossi Matias (Tel-Aviv University and Bell Labs), Nasir Rajpoot (University of Warwick and Yale University), and Cenk Sahinalp (University of Warwick and UPenn Center for BioInformatics)

  • Introduction (LZW-FP)
  • Alternative FP (FPA)
  • Implementation
  • Experimental Results
  • Experimental Data Sets
  • Download
  • Documentation
  • Benchmark Suite
  • Introduction (LZW-FP)

    Flexible parsing (FP) is a proposed extension for the common dictionary based compression schemes. The basic idea behind this algorithm is that we look one step ahead for the longest phrase in the dictionary instead of trying to find the longest possible phrase at hand. The FP is optimal for all dictionary schemes that satisfy the prefix property (which include most practical schemes such as Lempel-Ziv variants LZ-77, the UNIX gzip utility, and LZ-78, the UNIX compress utility.) LZW-FP is an implementation of the flexible parsing which employs the dictionary construction similar to the one in the popular LZW scheme. Please see the documents at the end of this page for a detailed description of the LZW-FP algorithm.

    
    
    
    
    
    Alternative FP (FPA)

    FPA uses the same flexible parsing scheme as described above. It differs from LZW-FP in that it has its own dictionary construction scheme which works as follows. After the parsing, which finds out the longest extending phrase in the next step, it adds to the dictionary the longest matching string and the character just after it (all of which forms a new phrase for the dictionary.)

    
    
    
    
    
    Implementation

    In our LZW-FP implementation, we limited the dictionary size to 216 (64K) phrases, and reset it when it was full (as is the case with compress, except when it finds from the compression performance so far that there is no need for resetting the dictionary.) Two versions of FPA were implemented, one with the dictionary size same as that of LZW-FP and the other one with the dictionary having 224 phrases (we name it FPA-24.)

    The preliminary implementation of flexible parsing in both LZW-FP and FPA is quite plain and can be regarded as a semi-quadratic one. Work is being carried out on their optimisation by incorporating more efficient data structures, and we hope to get significant improvement in terms of speed.

    
    
    
    
    Experimental Results

    The current implementation of LZW-FP and FPA has shown quite promising results over both compress and gzip. Following is a tabular presentation of the comparative compression performances of gzip, compress, LZW-FP, FPA, and FPA-24 when applied to four data sets:

  • DNA and protein sequences provided by the Center for BioInformatics, University of Pennsylvenia and CT scan and MR images provided by the Guys' and St. Thomas Hospital, London,
  • IID binary data obtained via UNIX C drand48(), a uniform distribution random number generator,
  • The new Canterbury corpus large data set, and
  • Some files from the Calgary corpus.
  • Table 1: Biological sequences and medical images
    
    Table 2: IID pseudorandom binary sequences
    
    Table 3: Canterbury corpus (large data set)
    
    Table 4: Calgary corpus
    

    From these tables, we can say that LZW-FP exhibits nice asymptotic properties for large size (roughly > 1MB) data files. FPA outperforms LZW-FP in almost every case, primarily because of efficient dictionary construction apart from the flexible parsing. FPA-24 outperforms both gzip and compress for all the files with size greater than 1MB.

    Following is a graphical presentation of the fact that LZW-FP, FPA, and FPA-24 outperform both gzip and compress utilities for pseudorandom binary sequences, specially for larger data files. Furthermore, all of these FP-variants approach the entropy much faster than the gzip and compress do.

    Figure 1: Compression performance for pseudorandom binary sequence
    
    Figure 2: Approaching the entropy for pseudorandom binary sequence
    
    Experimental Data Sets
  • Biological and medical data (4 files in gzipped tar format, 1964K)
  • Files from the Calgary corpus (19 files in gzipped tar format, 1047K)
  • Files from the Canterbury corpus, large set (3 files in gzipped tar format, 3183K)
  • Binary IID data (15 files in gzipped tar format, 932K)
  • Download

    Please feel free to download the preliminary versions of LZW-FP and FPA packages (compiled for Sun Solaris and Irix) and do let us know how you found their efficiency and compression performance as compared to the standard UNIX compress and gzip utilities.

    (Disclaimer! Although these software have been found to be free of bugs, please make sure that you have backups before compressing your data. We do not take responsibility for any loss of data arising from the use of these packages.)

    
    
    
    
    Documentation

    [ Introduction | FPA | Implementation | Results | Data | Download | Documentation | Benchmark ]

    Listed by Mathtools.net


    Copyright © 1998. All rights reserved.