How can I save a numpy array as bytes using a custom codec?

I have a numpy array of type int8 and shape (100,100). I have used Huffman coding to find a maximally efficient way to encode its contents. For my particular use case, the values I am trying to save are roughly normally distributed around 0 with a standard deviation of about 5 but this question applies to optimally saving ndarrays with any distribution. In my case, extreme values of up to -20 or 20 are observable in rare cases. Clearly it is more space-efficient to Huffman code this type of array than to use standard 8 bit integers. My question is how to do this.

I tried using np.save() and using Pickle but was not able to get the efficiency I want. Specifically, for an array of shape (100,100) I get a file of size 10,128 bytes using np.save(), which makes sense with 1 byte per integer plus overhead. Using pickle, I get a file of size 10,158 bytes, about the same. However, based on my Huffman coding scheme, I should be able to encode the array contents in my particular test case (shown below) in 144 bytes!! (not including overhead)

I tried mapping each integer in the array to its optimal byte string so I have an array of bytes (type S12) and then saving that but I get a 118kb file using both np.save() and pickle, so that obviously does not work.

Thanks for the help!

Code to reproduce my exact test case:

import pickle
import numpy as np

# Seed random number generator
np.random.seed(1234)

# Build random normal array and round it 
test_array = np.random.normal(0, 5, size=(100, 100))
test_array = np.round(test_array).astype(np.int8)

# Set encoding dictionary
encoding_dict = {6: b'0000',
 -8: b'00010',
 8: b'00011',
 5: b'0010',
 -5: b'0011',
 12: b'0100000',
 -13: b'01000010',
 14: b'010000110',
 -15: b'0100001110',
 -14: b'0100001111',
 10: b'010001',
 7: b'01001',
 -4: b'0101',
 4: b'0110',
 -7: b'01110',
 11: b'0111100',
 -11: b'0111101',
 -12: b'01111100',
 13: b'011111010',
 -19: b'011111011000',
 -18: b'011111011001',
 -16: b'01111101101',
 -17: b'011111011100',
 16: b'011111011101',
 15: b'01111101111',
 -10: b'0111111',
 -3: b'1000',
 -6: b'10010',
 -9: b'100110',
 9: b'100111',
 3: b'1010',
 -2: b'1011',
 1: b'1100',
 2: b'1101',
 -1: b'1110',
 0: b'1111'}

# Save using different methods
np.save('test.npy', test_array)
with open('test.pkl', 'wb') as file:
    pickle.dump(test_array, file)

# Try converting to bytes and then saving
bytes_array = np.array([encoding_dict[key] for key in test_array.flatten()]).reshape(test_array.shape)
np.save('test_bytes.npy', bytes_array)
with open('test_bytes.pkl', 'wb') as file:
    pickle.dump(bytes_array, file)

# See how many bytes it should take
tmp_flat = test_array.flatten()
tmp_bytes = np.zeros_like(tmp_flat)
for i in range(len(tmp_bytes)):
    tmp_bytes[i] = len(encoding_dict[tmp_flat[i]]) / 8
print(tmp_bytes.sum())

Upvotes: 1

Answers (3)

Mad Physicist

Reputation: 114310

Your error is here:

tmp_bytes = np.zeros_like(tmp_flat)

tmp_flat is an int8 array, so the statement tmp_bytes[i] = len(encoding_dict[tmp_flat[i]]) / 8 is truncating a lot of digits when it converts a float value to int. Replace the offending line with something like:

tmp_bytes = np.zeros(tmp_flat.shape, np.single)

But to show how to actually do the compression: I would recommend using np.packbits, which will actually create an array of 5493 bytes for you.

# Make a string of all the data
s = b''.join(encoding_dict[key] for key in test_array.ravel())
# Convert the string into an array
a = np.array(list(map(int, s.decode('ascii'))))
# pack it
result = np.packbits(a)

The statement a = ... is doing a whole lot of extra work, since it decodes the data, then copies it, then converts a string to an integer umpteen times, etc. Here is a longer, but much more efficient method:

s = bytearray(b'').join(encoding_dict[key] for key in test_array.ravel())
a = np.array(s)
a -= ord('0')    # A now contains just 0 and 1
result = np.packbits(a)

When you stash this array, make sure that you include the number of bits you expected, rather than the number of bytes. You can unpack to a binary string using np.unpackbits, which supports a count parameter specifically for this purpose (my addition by the way).

Final point, use ravel instead of flatten when you can. The latter always makes a copy, while the former generally does not.

Upvotes: 1

Mark Adler

Reputation: 112374

There is no way you could get a factor of 70 compression on the data you describe. Why do you think that?

Even if the input bytes were constrained to four values, the best you could get is a factor of four compression. (8 bits over 2 bits.) You have a normal distribution with 10 or 11 values just within ±1 sigma.

Maybe you could get a factor of two compression for random bytes with your statistics. On a good day.

Update:

Just calculated the entropy of your distribution, assuming a standard deviation of 5. I get an entropy per sample of 4.37 bits. So my estimate of a factor of two was overly optimistic. More like a factor of 1.8.

By the way, you don't need to do this by hand. You can use zlib with the Z_HUFFMAN_ONLY strategy. It will generate the optimal Huffman codes for you.

Upvotes: 2

hpaulj

Reputation: 231385

I haven't worked with this kind of encoding, but I question whether your 144 bytes is an accurate measure.

Your bytes_array is 100 elements of 12 bytes each ('S12'), or 12 times the size of the test_array (1 byte per element).

If instead we make a list:

In [440]: alist = [encoding_dict[key] for key in test_array.flatten()]
In [441]: len(alist)
Out[441]: 10000
In [442]: alist[:10]
Out[442]: 
[b'1101',
 b'10010',
 b'01001',
 b'1011',
 b'0101',
 b'0110',
 b'0110',
 b'1000',
 b'1111',
 b'0111101']

and looking that length of those strings:

In [444]: sum([len(i) for i in alist])
Out[444]: 43938

That's an average of 4 bytes per element. Even if we can somehow convert those bytes into bits, that's only a 50% compression:

In [445]: _/8
Out[445]: 5492.25

Upvotes: 1

How can I save a numpy array as bytes using a custom codec?

Answers (3)

Related Questions