darthS
darthS

Reputation: 31

How to optimally compress inverted indexes for time series dataset

I am attempting to compress a time series dataset with a compress ratio of 25%. This has turned into a vendetta for me.

The data is of 1-minute interval historical stock quotes over a 1 month period ( see notes for dataset ) with 0 missing data. This equals about 9000 data points of uint32_t type ( im not doing decimals )

My first attempt was to use the FastPFor compression on all the data. This resulted in ~80% compression ratio. Not good enough. So -

I first get rid of all timestamps ( obvious )

I then sorted the historical data and removed any duplicates. This reduced the number of unique values from ~ 5000 to 1000. From there, I used a differential SIMD compression algorithm to compress it further. These do bit packing also. This resulted in a final ~ 5% compression ratio. Great! Here comes the problem.

To reconstruct the dataset you have to be able to put it back in order. My idea was to have inverted indexes for each of the processed values above - where each index would refer to its original position. This of course just added 9000 numbers. This put the size to almost the original size.

Example:

Values    Indexes 

10  ===>  40, 20, 55, 100, 56, 21 

25  ===>  1, 5 

...

As a result, I attempt to compress the inverted indexes.

  1. Sort them
  2. Get rid of any values that are +1 from the previous value ( RLE )
  3. Compress each of the index lists using the SIMDCompression github from Lemire ( I also tried his FastPFor algorithm )

Unfortunately this attempt to compress the indexes was not good. It only resulted in ~75% compression ratio after all that with the actual compress using 20-64 bits per integer. Note that previously I mentioned that I was using 32bit numbers. The compression made index lists with only 1 number 2x than their original size ( i expected it to stay the same ).

The attempts at using the inverted indexes are futile - not good enough to justify the extra processing when its comparable to the original sizes.

Some other ideas I had:

What is the best way to compress inverted indexes?

Is there a theoretical minimum compression?

Do you know of any methods I can use instead of this?

Any input is appreciated.

Example Data

Notes

Upvotes: 0

Views: 189

Answers (1)

Mark Adler
Mark Adler

Reputation: 112374

The whole sorting thing seems pointless.

I took your series of 8000 (not 9000) values, took the differences, and wrote those as variable-length integers resulting in about 14,000 bytes. I then compressed that with gzip to get about 6000 bytes. You didn't say what the starting point was for your 25%, but if it was the four-byte binary integers (32,000 bytes), then that approach reduces it to less than 20%.

Upvotes: 0

Related Questions