Reputation: 620
I have a numpy array of type int8
and shape (100,100)
. I have used Huffman coding to find a maximally efficient way to encode its contents. For my particular use case, the values I am trying to save are roughly normally distributed around 0 with a standard deviation of about 5 but this question applies to optimally saving ndarrays with any distribution. In my case, extreme values of up to -20 or 20 are observable in rare cases. Clearly it is more space-efficient to Huffman code this type of array than to use standard 8 bit integers. My question is how to do this.
I tried using np.save()
and using Pickle but was not able to get the efficiency I want. Specifically, for an array of shape (100,100)
I get a file of size 10,128 bytes using np.save()
, which makes sense with 1 byte per integer plus overhead. Using pickle, I get a file of size 10,158 bytes, about the same. However, based on my Huffman coding scheme, I should be able to encode the array contents in my particular test case (shown below) in 144 bytes!! (not including overhead)
I tried mapping each integer in the array to its optimal byte string so I have an array of bytes (type S12
) and then saving that but I get a 118kb file using both np.save()
and pickle, so that obviously does not work.
Thanks for the help!
Code to reproduce my exact test case:
import pickle
import numpy as np
# Seed random number generator
np.random.seed(1234)
# Build random normal array and round it
test_array = np.random.normal(0, 5, size=(100, 100))
test_array = np.round(test_array).astype(np.int8)
# Set encoding dictionary
encoding_dict = {6: b'0000',
-8: b'00010',
8: b'00011',
5: b'0010',
-5: b'0011',
12: b'0100000',
-13: b'01000010',
14: b'010000110',
-15: b'0100001110',
-14: b'0100001111',
10: b'010001',
7: b'01001',
-4: b'0101',
4: b'0110',
-7: b'01110',
11: b'0111100',
-11: b'0111101',
-12: b'01111100',
13: b'011111010',
-19: b'011111011000',
-18: b'011111011001',
-16: b'01111101101',
-17: b'011111011100',
16: b'011111011101',
15: b'01111101111',
-10: b'0111111',
-3: b'1000',
-6: b'10010',
-9: b'100110',
9: b'100111',
3: b'1010',
-2: b'1011',
1: b'1100',
2: b'1101',
-1: b'1110',
0: b'1111'}
# Save using different methods
np.save('test.npy', test_array)
with open('test.pkl', 'wb') as file:
pickle.dump(test_array, file)
# Try converting to bytes and then saving
bytes_array = np.array([encoding_dict[key] for key in test_array.flatten()]).reshape(test_array.shape)
np.save('test_bytes.npy', bytes_array)
with open('test_bytes.pkl', 'wb') as file:
pickle.dump(bytes_array, file)
# See how many bytes it should take
tmp_flat = test_array.flatten()
tmp_bytes = np.zeros_like(tmp_flat)
for i in range(len(tmp_bytes)):
tmp_bytes[i] = len(encoding_dict[tmp_flat[i]]) / 8
print(tmp_bytes.sum())
Upvotes: 1
Views: 1535
Reputation: 114310
Your error is here:
tmp_bytes = np.zeros_like(tmp_flat)
tmp_flat
is an int8
array, so the statement tmp_bytes[i] = len(encoding_dict[tmp_flat[i]]) / 8
is truncating a lot of digits when it converts a float value to int. Replace the offending line with something like:
tmp_bytes = np.zeros(tmp_flat.shape, np.single)
But to show how to actually do the compression: I would recommend using np.packbits
, which will actually create an array of 5493 bytes for you.
# Make a string of all the data
s = b''.join(encoding_dict[key] for key in test_array.ravel())
# Convert the string into an array
a = np.array(list(map(int, s.decode('ascii'))))
# pack it
result = np.packbits(a)
The statement a = ...
is doing a whole lot of extra work, since it decodes the data, then copies it, then converts a string to an integer umpteen times, etc. Here is a longer, but much more efficient method:
s = bytearray(b'').join(encoding_dict[key] for key in test_array.ravel())
a = np.array(s)
a -= ord('0') # A now contains just 0 and 1
result = np.packbits(a)
When you stash this array, make sure that you include the number of bits you expected, rather than the number of bytes. You can unpack to a binary string using np.unpackbits
, which supports a count
parameter specifically for this purpose (my addition by the way).
Final point, use ravel
instead of flatten
when you can. The latter always makes a copy, while the former generally does not.
Upvotes: 1
Reputation: 112374
There is no way you could get a factor of 70 compression on the data you describe. Why do you think that?
Even if the input bytes were constrained to four values, the best you could get is a factor of four compression. (8 bits over 2 bits.) You have a normal distribution with 10 or 11 values just within ±1 sigma.
Maybe you could get a factor of two compression for random bytes with your statistics. On a good day.
Update:
Just calculated the entropy of your distribution, assuming a standard deviation of 5. I get an entropy per sample of 4.37 bits. So my estimate of a factor of two was overly optimistic. More like a factor of 1.8.
By the way, you don't need to do this by hand. You can use zlib with the Z_HUFFMAN_ONLY
strategy. It will generate the optimal Huffman codes for you.
Upvotes: 2
Reputation: 231385
I haven't worked with this kind of encoding, but I question whether your 144 bytes is an accurate measure.
Your bytes_array
is 100 elements of 12 bytes each ('S12'), or 12 times the size of the test_array
(1 byte per element).
If instead we make a list:
In [440]: alist = [encoding_dict[key] for key in test_array.flatten()]
In [441]: len(alist)
Out[441]: 10000
In [442]: alist[:10]
Out[442]:
[b'1101',
b'10010',
b'01001',
b'1011',
b'0101',
b'0110',
b'0110',
b'1000',
b'1111',
b'0111101']
and looking that length of those strings:
In [444]: sum([len(i) for i in alist])
Out[444]: 43938
That's an average of 4 bytes per element. Even if we can somehow convert those bytes into bits, that's only a 50% compression:
In [445]: _/8
Out[445]: 5492.25
Upvotes: 1