Reputation: 1801
I have many unique numbers, all positive and the order doesn't matter, 0 < num < 2^32
.
Example: 23 56 24 26
The biggest, 56
, needs 6 bits
space. So, I need: 4*6 = 24 bits
in total.
I do the following to save space:
I sort them first: 23 24 26 56
(because the order doesn't matter)
Now I get the difference of each from the previous: 23 1 2 30
The biggest, 30
, needs 5 bits
space.
After this I store all the numbers in 4*5 bits = 20 bits
space.
Question: how to further improve this algorithm?
More information: Since requested, the numbers are mostly on the range of 2.000-4.000
. Numbers less than 300
are pretty rare. Numbers more than 16.000
are pretty rare also. Generally speaking, all the numbers will be close. For example, they may be all in the 1.000-2.000
range or they may all be in the 16.000-20.000
range. The total number of numbers will be something in the range of 500-5.000
.
Upvotes: 4
Views: 873
Reputation: 78316
How many numbers do you have ? If your set covers the range [0..(2^32)-1]
densely enough (you do the maths) then a 4GiB bitfield, where the n
-th bit represents the presence, or absence, of the natural number n
may be useful.
Upvotes: 2
Reputation: 6246
Your first step is good one to take because sorting reduces the differences to least. Here is a way to improve your algorithm:
Use of Huffman coding is more important then your step; I'll show you why:
consider the following data:
1 2 3 4 5 6 7 4294967295
where 4294967295 = 2^32-1. Using your algorithm:
1 1 1 1 1 1 1 4294967288
total bits needed is still 32*8
Using Huffman coding, the frequencies are:
1 => 7
4294967288 => 1
Huffman codes are 1 => 0
and 4294967288 => 1
Total bits needed = 7*1 + 1 = 8 bits
Huffman coding reduces size by 32*8/8 = 32 times
Upvotes: 4
Reputation: 12898
This problem is well known in database community as "Inverted index compression". You can google for some papers.
Following are some of the most common techniques:
VByte and Simple9/16 are easiest to implement, fast and have good compression ratio in practice.
Huffman coding is not very good for index compression because it is slow and differences are quite random in practice. (But it may be a good choice in your case.)
Upvotes: 4
Reputation: 596
If your numbers are not uniformly distributed, a better compression will be achieved by using frequencies of the numbers and affect less bits to most frequent ones. This is the idea behind huffman coding.
Upvotes: 0