Reputation: 7017

Sorted byte arrays with unique values - maximal possible compression

I have byte arrays with following constraints:

Length between 1 and 256
Length median about 128, but I have to verify this on larger dataset
Values are sorted ascending
Values are unique

I am looking for algorithm for maximal possible compression of this data. Maximal uncompressed size for array if it is full is 256B. For median it is 128B.

For now best compression I know is to use bit-field to store if byte is in array or not, and I can omit trailing zeros. So for one array i will use ceiling("max value" / 8) B. For full array (or array containing 248) this will be 32B.

This will reduce size in general, but it is bad for sparse arrays with length < 32. I can use flag to store data compressed or uncompressed if it turns out that uncompressed array is smaller than compressed.

Is there any other trick/optimization/compression i can use to reduce size even more?

One short example of data to eliminate misunderstandings, please note that this array is shorter than expected array in data:

{ 0, 1, 5, 7, 88, 105, 233, 234, 235, 255 }

Upvotes: 1

Answers (2)

Mark Adler

Reputation: 112502

The best you could do would be arithmetic compression, using a model that has the probability of the next symbol uniformly distributed over the remaining unused byte values. Then the number of bits used for each symbol will be approximately the log base 2 of the number of byte values remaining. That adds up to about 68 bits.

At some cost in compression you could simulate this with a fixed number of bits for the next symbol, which is the least number of bits to represent the number of remaining symbols. For the example you give, you would use eight bits for each of the first seven values, and five bits for each of the remaining values. So 71 bits total. Still a good bit less than 256 bits.

Depending on how you know the length of your sequence or the length of the compressed data, you may need to reserve a state to indicate the end of the sequence. So you would need to add a bit if you happened to be right at a power of two for the next entry in the sequence. The first entry would then need to be nine bits, with the first bit being 1 indicating an empty sequence.

As you suggest, you could have an initial bit before that that determines whether this coding or a bit map was used. When coding you would pick the smallest.

Upvotes: 0

Juraj Blaho

Reputation: 13471

One option may be to:

Calculate differences between consecutive values. These differences are usually small positive numbers.
Encode the differences using Golomb, Huffman or arithmetic coding, where small numbers are encoded with less bits than large numbers.

Upvotes: 3

Sorted byte arrays with unique values - maximal possible compression

Answers (2)

Related Questions