Jakube
Jakube

Reputation: 3575

Compress array of numbers

I have a large array (~400.000.000 entries) with integers of {0, 1, ..., 8}. So I need 4 bits per entry. Around 200 MB.

At the moment I use a byte-array and save 2 numbers in each entry.

I wonder, if there is a good method, to compress this array. I did a quick research and found algorithms like Huffmann or LZW. But these algorithms are all for compressing the data, send the compressed data to someone and decompress them.

I just want to have a table, with less memory space, so I can load it into the RAM. The 200MB table easily fits, but I'm thinking on even bigger tables.

Important is, that I still be able to determine the values on certain positions.

Any tips?


Further information: I just did a little experimenting, and it turns out, that on average 2.14 consecutive numbers have the same value. There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours, 85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights. So more than half of the numbers are 6s.

I only need a fast getValue(idx) funktion, setValue(idx, value) is not important.

Upvotes: 4

Views: 9957

Answers (6)

RobertB.
RobertB.

Reputation: 51

There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours, 85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights

Total = 429981696 symbols

Assuming the distribution is random, the entropy theorem says you cannot do better than 618297161.7 bits ~ 73.707 MB or on average 1.438 bits / symbol.

Minimum number of bits is SUM(count[i] * LOG(429981696 / count[i], 2)).

You can achieve this size using a range coder.

Given Sqrt(N) = 20736

Again you can achieve O(Sqrt(N)) complexity for accessing a random element by saving an int[k = 0 .. CEIL(SQRT(N)) - 1] state with the arithmetic decoder state after each SQRT(N) decoded symbols. This allows fast decoding of the next 20736 block of symbols.

The complexity of accessing an element drops to O(1) if you access the memory stream in a linear way.

Additional memory used: 20736 * 4 = 81KB.

Upvotes: 2

Ebbe M. Pedersen
Ebbe M. Pedersen

Reputation: 7518

As more than 1/2 of the entries are sixes, then just encode those as a single bit. Use 2 bits for the second most common and so on. Then you have something like this:

                        encoding               total   
           #entrie      bit pattern  #bits    # of bits
 zero            1      000000001      9              9
 ones          154      0000001        7           1078  
 twos        10373      000001         6          62238
 threes     385990      00001          5        1929950
 fours     8146188      0001           4       32584752
 fives    85008968      01             2      170017936
 sixes   265638366      1              1      265638366
 sevens   70791576      001            3      212374728
 eights         80      00000001       8            640
--------------------------------------------------------
 Total                                        682609697 bits

With 429981696 entries encoded with 682609697 bits, you would then need 1.59 bit per entry on average.

Edit:

To allow for fast lookup, you can make an index into the compressed data that show where every n entry starts. Finding the exact value would then involve decompressing on average n/2 entries. Depending on how fast it should be you can adjust the number of entries in the index. To reduce the size of the pointer into the compressed data (and those the size of the index), use an estimate and just store the offset from that estimate.

                                Estimated pos   Offset from
# entry no   Actual Position     (n * 1.59)      estimated
     0             0                  0               0      
   100           162                159               3      Use this 
   200           332                318              14  <-- column as   
   300           471                477              -6      the index
   400           642                636               6
   500           807                795              12
   600           943                954             -11

The overhead for such an index with every 100 entry and 10 bits for the offset, would mean 0.1 bit extra per entry.

Upvotes: 2

maaartinus
maaartinus

Reputation: 46482

It depends on how your data look like. Are there repeating entries, or do they change only slowly, or what?

If so, you can try to compress chunks of your data and decompress when needed. The bigger the chunks, the more memory you can save and the worse the speed. IMHO no good deal. You could also save the data compressed and decompress in memory.

Otherwise, i.e., in case of no regularities, you'll need at least log(9) / log(2) = 3.17 bits per entry and there's nothing what could improve it.

You can come pretty close to this value by packing 5 numbers into a short. As 9**5 = 59049 < 65536 = 2**16, it fits nearly perfectly. You'll need 3.2 bits per number, no big win. Packing of five number is given via this formula

a + 9 * (b + 9 * (c + 9 * (d + 9 * e)))

and unpacking is trivial via a precomputed table.

UPDATE after question update

Further information: I just did a little experimenting, and it turns out, that on average 2.14 consecutive numbers have the same value. There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours, 85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights. So more than half of the numbers are 6s.

The fact that there are on the average about 2.14 consecutive numbers are the same could lead to some compression, but in this case it says us nothing. There are nearly only fives and sixes, so encountering two equal consecutive numbers seems to be implied.

Given this facts, you can forget my above optimization. There are practically only 8 values there as you can treat the single zero separately. So you need just 3 bits per value and a single index for the zero.

You can even create a HashMap for all values below four or above seven, store there 1+154+10373+385990+80 entries and use only 2 bits per value. But this is still far from ideal.

Assuming no regularities, you'd need 1.44 bit per value as this is the entropy. You could go over all 5-tuples, compute their histogram, and use 1 byte for encoding of the 255 most frequent tuples. All the remaining tuples would map to the 256th value, telling you that you have to look in a HashMap for the rare tuple value.

Some evaluation

I was curious if it can work. The packing of 5 numbers into one byte needs 85996340 bytes. There are nearly 5 million tuples which don't fit, so my idea was to use a hash map for them. Assuming rehashing rather than chaining it makes sense to keep it maybe 50% full, so we need 10 million entries. Assuming TIntShortHashMap (mapping indexes to tuples) each entry takes 6 bytes, leading to 60 MB. Too bad.

Packing only 4 numbers into one byte consumes 107495425 bytes and leaves 159531 tuples which don't fit. This looks better, however, I'm sure the denser packing could be improved a lot.

The results as produced by this little program:

*** Packing 5 numbers in a byte. ***
Normal packed size: 85996340.
Number of tuples in need of special handling: 4813535.

*** Packing 4 numbers in a byte. ***
Normal packed size: 107495425.
Number of tuples in need of special handling: 159531.

Upvotes: 5

OldCurmudgeon
OldCurmudgeon

Reputation: 65859

There are many options - most depend on how your data looks. You could use any of the following and even combinations of them.

LZW - or variants

In your case a variant that uses a 4-bit initial dictionary would probably be a good start.

You could compress your data in blocks so you could use the index requested to determine which block to decode on the fly.

This would be a good fit if there are repeating patterns in your data.

Difference Coding

Your edit suggests that your data may benefit from a differencing pass. Essentially you replace every value with the difference between it and its predecessor.

Again you would need to treat your data in blocks and difference fixed run lengths.

You may also find that using differencing following by LZW would be a good solution.

Fourier Transform

If some data loss would be acceptable then some of the Fourier Transform compression schemes may be effective.

Lossless JPEG

If your data has a 2-dimensional aspect then some of the JPEG algorithms may lebd themselves well.

The bottom line

You need to bear in mind:

  1. The longer time you spend compressing - up to a limit - the better compression ratio you can achieve
  2. There is a real practical limit to how far you can go with lossless compression.
  3. Once you go lossy you are essentially no longer restricted. You could approximate the whole of your data with new int[]{6} and get quite a few correct results.

Upvotes: 2

ganaraj
ganaraj

Reputation: 420

How about considering some caching solution, like mapdb, or apache jcs. This will enable you to persist the Collection to disk, thus enabling you to work with very large lists.

Upvotes: 1

skiwi
skiwi

Reputation: 69379

You should look into a BitSet to store it most efficiently. Contrary to what the name suggests, it is not exactly a set, it has order and you can access it per index.

Internally it uses an array of longs to store the bits and hence can update itself by using bit masks.

I don't believe you can store it any more efficiently natively, if you want even more efficiency, then you should consider packing/compression algorithms.

Upvotes: 0

Related Questions