gray_fox
gray_fox

Reputation: 177

Is there a way to store gzip's dictionary from a file?

I've been doing some research on compression-based text classification and I'm trying to figure out a way of storing a dictionary built by the encoder (on a training file) for use to run 'statically' on a test file? Is this at all possible using UNIX's gzip utility?

For example I have been using 2 'class' files of sport.txt and atheism.txt, hence I want to run compression on both of these files and store their dictionaries used. Next I want to take a test file (which is unlabelled, could be either atheism or sport) and by using the prebuilt dictionaries on this test.txt I can analyse how well it compresses under that dictionary/model.

Thanks

Upvotes: 10

Views: 4354

Answers (2)

Cyril
Cyril

Reputation: 66

As of 2023, you can experiment with zstd easily. Contrary to gzip, zstd builds a compression dictionnary and provides methods to generate and store the dictionary.

Here is an example with the python binder python zstandard: https://python-zstandard.readthedocs.io/

import zstandard

ENCODING="UTF-8"

training_data = "my training text"
dictionary = zstandard.ZstdCompressionDict(training_data.encode(ENCODING), dict_type=zstandard.DICT_TYPE_RAWCONTENT)
compressor = zstandard.ZstdCompressor(dict_data=dictionary)
test_data = "my test text"
compressed = compressor.compress(test_data.encode(ENCODING))
compessed_length = len(compressed)

The ftcc project implements this approach end to end and provides accuracy benchmarks.

Disclaimer: I am the author of the ftcc project.

Upvotes: 2

Mark Adler
Mark Adler

Reputation: 112404

deflate encoders, as in gzip and zlib, do not "build" a dictionary. They simply use the previous 32K bytes as a source for potential matches to the string of bytes starting at the current position. The last 32K bytes is called the "dictionary", but the name is perhaps misleading.

You can use zlib to experiment with preset dictionaries. See the deflateSetDictionary() and inflateSetDictionary() functions. In that case, zlib compression is primed with a "dictionary" of 32K bytes that effectively precede the first byte being compressed as a source for matches, but the dictionary itself is not compressed. The priming can only improve the compression of the first 32K bytes. After that, the preset dictionary is too far back to provide matches.

gzip provides no support for preset dictionaries.

Upvotes: 13

Related Questions