m2j
m2j

Reputation: 155

Compression algorithm for contiguous numbers

I'm looking for an efficient encoding for storing simulated coefficients.

The data has thousands of curves with each 512 contiguous numbers with single precision. The data may be stored as fixed point while it should preserve about 23-bit precision (compared to unity level).

The curves could look like those:

exaplatory curves

My best approach was to convert the numbers to 24-bit fixed point. Repeatedly I took the adjacent difference as long as the sum-of-squares decreases. When compressing the resulting data using LZMA (xz,lzip) I get about 7.5x compression (compared to float32).

Adjacent differences are good at the beginning, but they emphasize the quantization noise at each turn.

I've also tried the cosine transform after subtracting the slope/curve at the boundaries. The resulting compression was much weaker.

I tried AEC but LZMA compressed much stronger. The highest compression was using bzip3 (after adjacent differences).

I found no function to fit the data with high precision and a limited parameter count.

Is there a way to reduce the penalty of quantization noise when using adjacent differences?

Are there encodings which are better suited for this type of data?

Upvotes: 1

Views: 91

Answers (1)

Mark Adler
Mark Adler

Reputation: 112642

You could try a higher-order predictor. Your "adjacent difference" is a zeroth-order predictor, where the next sample is predicted to be equal to the last sample. You take the differences between the actuals and the predictions, and then compress those differences.

You can try first, second, etc. order predictors. A first-order predictor would look at the last two samples, draw a line between those, and predict that the next sample will fall on the line. A second-order predictor would look at the last three samples, fit those to a parabola, and predict that the next sample will fall on the parabola. And so on.

Assuming that your samples are equally spaced on your x-axis, then the predictors for x[0] up through cubics are:

  1. x[-1] (what you're using now)
  2. 2*x[-1] - x[-2]
  3. 3*x[-1] - 3*x[-2] + x[-3]
  4. 4*x[-1] - 6*x[-2] + 4*x[-3] - x[-4]

(Note that the coefficients are alternating-sign binomial coefficients.)

I doubt that the cubic polynomial predictor will be useful for you, but experiment with all of them to see if any help.

Assuming that the differences are small, you should use a variable-length integer to represent them. The idea would be to use one byte for each difference most of the time. For example, you could code seven bits of difference, say -64 to 63, in one byte with the high bit clear. If the difference doesn't fit in that, then make the high bit set, and have a second byte with another seven bits for a total of 14 with that second high bit clear. And so on for larger differences.

Upvotes: 2

Related Questions