daLime
daLime

Reputation: 47

Compressing a list of low bit integers (2, 3, 4 bits)

I have a relatively long list of low bit integers by size 16*10^6.

The sparsity level of this list is about 40%, so applying any sparse matrix does not work.

The first step I tried Run-length encoding, which does not work here, there are not so many repeating values.

Moreover, I need an algorithm with fast decoding, that is quite crucial, encoding can be as long as it required.

Huffman and Zstandard on average lead to x3 compression, which is good. However, Huffman with such length does not meet time decoding criteria.

While Zstandard is good, I was thinking if it is possible to do something better based on the fact that there are only 16 unique values (considering 4 bits). Also, I have no idea what Zstandard does inside and how it works.

It is not necessary to have a very fast implementation, just a proof of concept would suffice.

For that, I looked into dictionary coding but most of the algorithms assume int32 or FP32 data formats, so the overhead for the code-book and indexing beats the purpose.

Also, while there some solutions for sorted integers list in my case the list is not sorted.

I also looked into how the BMP format works, but there is no clear evidence it helps to compress the data. So, any pointers would be appreciated.

The main requirements are that the data is low bit and fast decoding is important.

Upvotes: 0

Views: 57

Answers (1)

Mark Adler
Mark Adler

Reputation: 112502

If you put each integer in one byte, then the usual compressors like gzip, zstd, xz, will do entropy coding to take advantage of any skewed distribution of the 16 possible values, to get you potentially better than four bits per value.

If you're not getting better than four bits per value, then as the comments suggest, simple store two values in each byte of output. In order to distinguish an even from an odd number of input values, always end the stream with one more zero value, and pad out the last byte with one bits if needed. Then if the last bits are four ones, there was an even number of values so drop the last byte. Otherwise the last four bits are zeros and there was an odd number of bytes, so ignore those last four zeros.

Upvotes: 1

Related Questions