LC Nielsen
LC Nielsen

Reputation: 63

Using Python to compress data into e.g. 12-bit chunks?

I have a type of data that is output as ~28 million integers ranging from 0 to 4095 (It technically comes of the hardware out as a signed 16-bit integers ranging from 0 to (1/2) * 2^16, but this representation is needlessly precise). In principle therefore each datapoint's value can be represented by 12 bits - a byte and a nybble, if you will. I'm dealing with, in the long term, moderately large volumes of this data (Terabytes in the double digits) which I intend to store as binaries, so obviously losslessly compressing it to 75% of its size would be welcome.

Obviously I could just write a function that encodes my data into booleans and back and use Numpy's binary handling functions to parse. However I have to balance this against ease/speed of storage and retrieval. Therefore I wonder if there's any existing package, algorithm, etc which accomplishes this in a simple and efficient way. I can work with Fortran or C if I need to, so it's an option to make a module in those, but my colleagues would prefer if I didn't.

Upvotes: 2

Views: 509

Answers (2)

Alex Reynolds
Alex Reynolds

Reputation: 96937

Could you package/bit-shift two 12-bit integers into an array of three bytes (24 bits), and then use bit shifting to get the upper and lower 12 bits?

I imagine such an encoding would also compress well, on top of the space savings from encoding, given the redundancy, or if your data are particularly sparse or integers are distributed in a certain way.

I don't know a ton about numpy but from a cursory look, I believe it can store arrays of bytes, and there are bit shifting operands available in Python. If performance is a requirement, one could look into Cython for C-based bit operations on an unsigned char * within Python.

You'd need to figure out offsets, so that you always get on the correct starting byte, or you'd get something like a "frameshift mutation", to use a biological metaphor. That could be bad news for TB-sized data containers.

Upvotes: 2

Samwise
Samwise

Reputation: 71454

Have you tried gzip.compress()? https://docs.python.org/3/library/gzip.html

It's not specialized for the particular task you're doing, but if you have something as simple as a series of bytes where a subset of the bits are always zeroes, I'd expect gzip to handle that about as well as a specialized compression algorithm would, and it has the benefit of being a universal format that other tools will have no problem reading.

Upvotes: 3

Related Questions