sandymac
sandymac

Reputation: 21

Ways to assit the compression of large custom data files

I'm seeking advice on how to better assist compression tools get better lossless compression.

I have many large files (>100meg) containing sensor readings from a variety of sensors. The samples from various sensors are of different bit sizes (16 bit, 24 bit, 32 bit) and different frequencies (70Hz to 250Hz). With the common compressors I'm aware of (zip, gzip, bzip2) I can get a compressed file about 70% of the original file size. It seems to me if I could tell the compression tool these bytes are this type of sample and those bytes are another sample type there may be compression gains to be had but I'm not aware of anything that would let me do this.

Upvotes: 1

Views: 272

Answers (1)

Mark Adler
Mark Adler

Reputation: 112284

Step 0 would be to code the data in binary. (16 bits in two bytes, 24 bits in three bytes, etc.) I hope that you're already doing that.

Step 1 would be to use differences. From your description, I bet that successive values don't change much. Therefore differences will be small and have many leading zero bits. Try that, and then a general-purpose compressor.

Step 2 would be to use variable-length integer coding. The high bit of each byte determines the span of each integer. The first byte of an integer always has a high bit of zero. All subsequent bytes of the same integer have a high bit of one. Build the integer out of the low seven bits of each byte. (I take the first byte to have the least significant bits, but you could do it most-significant bit order as well.) This will code your small differences in one byte. Also this coding will handle any number of bits in the samples, which is convenient in your application. Try this, and then a general-purpose compressor.

Step 3 might be more detailed analysis of the waveforms for a better predictor. Step 1 simply uses the last value as the predictor. You could have a more complex function of the previous n values as the predictor for the next value. Whether this would help is highly dependent on your data.

Upvotes: 1

Related Questions