Reputation: 4556
In Java, at some point in my program I have to process gigabytes of int[]
arrays in memory. They are sorted and contain only natural (like 1, 2, 3, 4
, ..., up to n
) numbers which represent file lines. Number n
is the number of lines in a file and it can be maximum 100000
. So arrays are simply subsets of a set of all lines in a file. As you may calculate, there are millions of such subsets and having some of them can weight alot. As for data distribution inside those subsets (let us call them arrays now) it is completely random: that is a long array can happen of 50000
numbers and a small one with only say 1500
numbers; and each array contains unpredictable sequences such that it can be [3, 10, 11, 12, 13, 14, 15, 135, 136, ...]
or [2, 3, 746, 7889, 7892, 80000,...]
.
Since I have lots of arrays to compress / decompress I would like to find the fastest solution in terms of time spent per execution. Thus overhead should be as minimal as possible.
What library would you recommend?
Upvotes: 4
Views: 2379
Reputation: 112492
You can preprocess the data losslessly to improve the compression. Leave the first value as is. Make each subsequent value the difference between it and the previous value minus one. You are assured that such differences are non-negative. Now code each integer as a variable-length integer using sequences of bytes. E.g. when decoding, 0..127 is one byte. If the high bit of that first byte is set (128..255), then take the low seven bits as the low seven bits of the integer, and get the next byte. Use the whole byte if the high bit is zero as the next eight more significant bits, or just the low seven bits if the high bit is one. Continue until you get to a byte with the high bit equaling zero, which signifies the end of the integer.
Now you have encoded the integers as a sequence of bytes, potentially quite a bit shorter than encoding each original integer as, say, four or eight bytes each. Furthermore you can now apply any standard compression technique that works on series of bytes, and potentially expect some gain from that. E.g. if series of sequential line numbers are common, then you get a string of zero bytes which is highly compressible.
For the greatest speed in compression and decompression while sacrificing the degree of compression, look at lz4. If you don't need something that fast, look at zlib, where you can select the compression speed and effectiveness with the compression level.
For your examples, random choices of 1500 out of 10000 results in about 1720 bytes uncompressed, 1600 bytes compressed. Random choices of 50000 out of 100000 results in 50000 bytes uncompressed, 18600 bytes compressed. The compressions were done with the fastest zlib compression, level 1.
Note that in the latter case, where half of the line numbers are used, it would be more efficient to use a bit array, which would be 12500 bytes uncompressed. In that case, the data cannot be compressed, since the bit map appears random (half the bits set, half not set). More or less, e.g. 25000 or 75000, both result in bitmaps that are compressible, both to about 10500 bytes.
Compressed bitmaps are smaller for about 12500 line numbers and above, whereas the differenced variable-integers compressed are smaller for fewer than about 12500 line numbers. That cutoff is the point at which the two approaches have about the same uncompressed size of 12500 bytes.
Upvotes: 3
Reputation: 4684
Maybe this can help you as well: Compressing array of integers in java
Do you have to make a lot of calculations on the arrays or is it read only?
Edit:
//If the space is more important than performance this might work:
//Not this might be totally stupid for some cases
// First element should be false since its the 0 ;)
boolean[] numbers = { false, true, true, true, false, false, true };
for (int i = 0; i <= numbers.length - 1; i++) {
if (numbers[i]) {
// or do some calculations on/with a copy of i
System.out.println(i);
}
}
Since a boolean arry uses 1 byte to store each information (+overhead) This would mean with a maximum of 100'000 entries: 100'000 bytes = ~97kb for each array
Upvotes: 0