esafwan
esafwan

Reputation: 18029

Using a preset deflate dictionary to reduce compressed archive file size

I have a requirement where text files are send from one location to other. Both location are in our control. The nature of content and the words that could appear in this are mostly the same. Which means, if I keep the delate dictionary in both location once, there is no need to send it with file.

I have been reading about this last 1 week and experimenting with some available codes such as this & this.

However, I am still in dark.

Few questions I still have:

  1. Can we generate and use custom deflate dictionary from a preset of words?
  2. Can we send file without the deflate dictionary and use local one?
  3. If not gzip, are there any such compression library that can be used for this purpose?

Some references I stumbled upon so far:

  1. https://medium.com/iecse-hashtag/huffman-coding-compression-basics-in-python-6653cdb4c476
  2. https://blog.cloudflare.com/improving-compression-with-preset-deflate-dictionary/
  3. https://www.euccas.me/zlib/#zlib_optimize_cloudflare_dict

Upvotes: 4

Views: 1420

Answers (2)

esafwan
esafwan

Reputation: 18029

Below are the specific answers I found along with example codes.

1. Can we generate and use custom deflate dictionary from a preset of words?

Yes, this can be done. A quick example in python will as below:

import zlib

#Data for compression
hello = b'hello'    

#Compress with dictionary
co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
compress_data = co.compress(hello) + co.flush()

2. Can we send a file without the deflate dictionary and use local one?

Yes, you can send just the data without dictionary. The compressed data is in compress_data in above example code. However, to decompress you will need the zdict value passed during compression. Example of how it is decompressed:

hello = b'hello'  #for passing to zdict  
do = zlib.decompressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
data = do.decompress(compress_data)

A full example code with and without dict data:

import zlib

#Data for compression
hello = b'hello'

#Compression with dictionary
co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
compress_data = co.compress(hello) + co.flush()

#Compression without dictionary
co_nodict = zlib.compressobj(wbits=-zlib.MAX_WBITS, )
compress_data_nodict = co_nodict.compress(hello) + co_nodict.flush()

#De-compression with dictionary
do = zlib.decompressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
data = do.decompress(compress_data)

#print compressed output when dict used
print(compress_data)

#print compressed output when dict not used
print(compress_data_nodict)

#print decompressed output when dict used
print(data)

Above code doesn't works with unicode data. For unicode data you have to do something as below:

import zlib

#Data for compression
unicode_data = 'റെക്കോർഡ്'
hello = unicode_data.encode('utf-16be')

#Compression with dictionary
co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
compress_data = co.compress(hello) + co.flush()
...

JS based approach references:

  1. How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?
  2. Compression of data with dictionary using zlib in node.js

Upvotes: 4

Mark Adler
Mark Adler

Reputation: 112394

The zlib library supports dictionaries with the zlib (not gzip) format. See deflateSetDictionary() and inflateSetDictionary().

There is nothing special about the construction of a dictionary. All it is is 32K bytes of strings that you believe will occur often in the data you are compressing. You should put the most common strings at the end of the 32K.

Upvotes: 4

Related Questions