Luka
Luka

Reputation: 1801

Improve number compression algorithm?

I have many unique numbers, all positive and the order doesn't matter, 0 < num < 2^32.
Example: 23 56 24 26

The biggest, 56, needs 6 bits space. So, I need: 4*6 = 24 bits in total.

I do the following to save space:
I sort them first: 23 24 26 56 (because the order doesn't matter)
Now I get the difference of each from the previous: 23 1 2 30

The biggest, 30, needs 5 bits space.
After this I store all the numbers in 4*5 bits = 20 bits space.

Question: how to further improve this algorithm?

More information: Since requested, the numbers are mostly on the range of 2.000-4.000. Numbers less than 300 are pretty rare. Numbers more than 16.000 are pretty rare also. Generally speaking, all the numbers will be close. For example, they may be all in the 1.000-2.000 range or they may all be in the 16.000-20.000 range. The total number of numbers will be something in the range of 500-5.000.

Upvotes: 4

Views: 873

Answers (4)

High Performance Mark
High Performance Mark

Reputation: 78316

How many numbers do you have ? If your set covers the range [0..(2^32)-1] densely enough (you do the maths) then a 4GiB bitfield, where the n-th bit represents the presence, or absence, of the natural number n may be useful.

Upvotes: 2

Vikram Bhat
Vikram Bhat

Reputation: 6246

Your first step is good one to take because sorting reduces the differences to least. Here is a way to improve your algorithm:

  1. sort and calculate differences as you have done.
  2. Use Huffman coding on it.

Use of Huffman coding is more important then your step; I'll show you why:

consider the following data:

1 2 3 4 5 6 7 4294967295

where 4294967295 = 2^32-1. Using your algorithm:

1 1 1 1 1 1 1 4294967288

total bits needed is still 32*8

Using Huffman coding, the frequencies are:

1 => 7
4294967288 => 1

Huffman codes are 1 => 0 and 4294967288 => 1

Total bits needed = 7*1 + 1 = 8 bits

Huffman coding reduces size by 32*8/8 = 32 times

Upvotes: 4

max taldykin
max taldykin

Reputation: 12898

This problem is well known in database community as "Inverted index compression". You can google for some papers.

Following are some of the most common techniques:

  • Variable byte coding (VByte)
  • Simple9, Simple16
  • "Frame Of Reference" family of techniques
    • PForDelta
    • Adaptive Frame Of Reference (AFOR)
  • Rice-Golomb coding (often used as a part of other techniques)

VByte and Simple9/16 are easiest to implement, fast and have good compression ratio in practice.

Huffman coding is not very good for index compression because it is slow and differences are quite random in practice. (But it may be a good choice in your case.)

Upvotes: 4

Louis Hugues
Louis Hugues

Reputation: 596

If your numbers are not uniformly distributed, a better compression will be achieved by using frequencies of the numbers and affect less bits to most frequent ones. This is the idea behind huffman coding.

Upvotes: 0

Related Questions