Reputation: 1866
I am creating a system that stores large numpy arrays in pyarrow.plasma
.
I want to give each array a unique, deterministic plasma.ObjectID
, np.array
is not hashable sadly
My current (broken) approach is:
import numpy as np
from pyarrow import plasma
def int_to_bytes(x: int) -> bytes:
return x.to_bytes(
(x.bit_length() + 7) // 8, "big"
) # https://stackoverflow.com/questions/21017698/converting-int-to-bytes-in-python-3
def get_object_id(arr):
arr_id = int(arr.sum() / (arr.shape[0]))
oid: bytes = int_to_bytes(arr_id).zfill(20) # fill from left with zeroes, must be of length 20
return plasma.ObjectID(oid)
But this can easily fail, for example:
arr = np.arange(12)
a1 = arr.reshape(3, 4)
a2 = arr.reshape(3,2,2)
assert get_object_id(a1) != get_object_id(a2), 'Hash collision'
# another good test case
assert get_object_id(np.ones(12)) != get_object_id(np.ones(12).reshape(4,3))
assert get_object_id(np.ones(12)) != get_object_id(np.zeros(12))
It also involves summing the array, which could be very slow for large arrays.
Feel free to assume that the dtype
of arr
will be np.uint
or np.int
.
I know think that it's impossible to never have a hash collision (I only have 20 bytes of ID and there are more than 2^20) possible inputs, so I am just looking for something that is either a) cheaper to compute b) less likely to fail in practice
or, ideally, both!
Upvotes: 3
Views: 2054
Reputation: 43817
The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes
however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.
def hasharr(arr):
hash = hashlib.blake2b(arr.tobytes(), digest_size=20)
for dim in arr.shape:
hash.update(dim.to_bytes(4, byteorder='big'))
return hash.digest()
Exmaple:
>>> hasharr(a1)
b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ'
>>> hasharr(a2)
b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"
I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.
I'm not sure why you tagged pyarrow
but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers()
and convert these buffers (there will be multiple and some may be None
) to byte strings with buf.to_pybytes()
. Just hash all the buffers. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.
Upvotes: 4