Reputation: 68
I want to convert an integer partition to murmur3 hash with below code.
from cassandra.metadata import Murmur3Token
a = 3202012
h = Murmur3Token.hash_fn(a)
but i got following error
object of type 'int' has no len()
there is no problem with string.
Upvotes: 1
Views: 161
Reputation: 16747
If you don't want to read my post, in short simplest solution is to just use hex(a)
for integer a
i.e. h = Murmur3Token.hash_fn(hex(a))
.
Your error is due to that Murmur3Token.hash_fn(...)
is not implemented for integer (int
) Python type.
Usually hash functions are implemented for hashing just single block of bytes of any size, and not for any other non-bytes types. Other types are usually converted to bytes, this is called Serialization.
You function supports both bytes
and str
types. You have to convert somehow you number to one of these two types. I've implemented next three solutions (variants 0, 1, 2) of conversion:
0
is converting to hex string, just do hex(n)
. This is most simple to use solution, but can be slower than other solutions for large integers, but faster for small integers. This is a very stable solution, meaning that hex representation and hence hash value will never change.1
is converting to bytes through pickle.dumps(n)
. This is a bit more complex than previous solution, needing module pickle
. It is also faster for large integers but slower for small. Also might be somewhat unstable because pickling format may change some time resulting in different produced bytes and hash.2
is converting to bytes through implemented by me function to_bytes(n)
. It is even 1.5x
faster than pickle
solution for small ints and around same speed for large inputs. This solution is also stable in a sence that unlike pickle
format my format will not change (unless you modify my function).In general if you want just simplicity or if your ints are smaller than 1024 bits then just use hex(n)
. If your integers are very large (more than 1024 bits) use my function to_bytes(n)
, or you also may still use hex(n)
it is just 2.5x
slower, although maybe this is quite large slowdown if you have a lot of data, then prefer to_bytes(n)
.
NOTE: All 3 different functions/ways of serializing described in my post will produce different hashes. You need to choose just one of them and use it always for your numbers to produce same results each time.
My code needs installing next python modules one time through python -m pip install cassandra-driver timerit matplotlib
command line, modules timerit
and matplotlib
are used only for time/speed measurement and not needed later for actual hash computation.
Below are located 1) code 2) time measurement plots 3) console output.
# Needs: python -m pip install cassandra-driver timerit matplotlib
from cassandra.metadata import Murmur3Token
# Input
n = 3202012
# ---------- Variant 0, convert to string ----------
print(Murmur3Token.hash_fn(hex(n)))
# ---------- Variant 1, convert to bytes using pickle ----------
import pickle
print(Murmur3Token.hash_fn(pickle.dumps(n)))
# ---------- Variant 2, convert signed int to min num of bytes ----------
def to_bytes(n, order = 'little'): # order can be 'little' or 'big'
# Zig-Zag encode signed integer to unsigned integer, i.e. map
# 0 to 0, -1 to 1, 1 to 2, -2 to 3, 2 to 4, -3 to 5, 3 to 6, etc
n = (n << 1) if n >= 0 else (((-n - 1) << 1) | 1)
return n.to_bytes((n.bit_length() + 7) // 8, order)
print(Murmur3Token.hash_fn(to_bytes(n)))
# ---------- Time/Speed measure and Visualizing ----------
import random
from timerit import Timerit
random.seed(0)
Timerit._default_asciimode = True
ncycle = 16
def round_fixed(n, c):
s = str(round(n, c))
return s + '0' * (c - (len(s) - 1 - s.rfind('.')))
stats = []
for bit_len_log in range(0, 21, 1):
stats.append([])
bit_len = 1 << bit_len_log
n = random.randrange(1 << bit_len)
num_runs = round(max(1, 2 ** 11 / bit_len)) * 3
print('bit length =', bit_len if bit_len < 1024 else f'{round(bit_len / 1024)}Ki')
rt = None
for fi, (f, fn) in enumerate([
#(str, 'str'),
(hex, 'hex'),
(lambda x: pickle.dumps(x), 'pickle'),
(to_bytes, 'to_bytes'),
]):
print(f'var{fi} ({str(fn).ljust(len("to_bytes"))}): ', end = '', flush = True)
tim = Timerit(num = num_runs, verbose = 0)
for t in tim:
for i in range(ncycle):
Murmur3Token.hash_fn(f(n))
ct = tim.mean() / ncycle
print(f'{round_fixed(ct * 10 ** 6, 2)} mcs', end = '')
if rt is None:
rt = ct
print()
else:
print(f', speedup {round(rt / ct, 2)}x')
stats[-1].append({
'bll': bit_len_log, 'fi': fi, 'fn': fn, 't': ct * 10 ** 6, 'su': rt / ct,
})
import math, matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (9.6, 5.4)
for yt in ['t']:
plt.xlabel('bit len')
plt.yscale('log')
plt.ylabel('time, mcs')
for i in range(len(stats[0])):
p, = plt.plot([e[i]['bll'] for e in stats], [e[i][yt] for e in stats])
p.set_label(stats[0][i]['fn'])
plt.xticks([stats[i][0]['bll'] for i in range(0, len(stats), 2)], [f"2^{stats[i][0]['bll']}" for i in range(0, len(stats), 2)])
p10f, p10l = [r(3 * math.log(e) / math.log(10)) for e, r in zip(plt.ylim(), (math.floor, math.ceil))]
pows = [i / 3 for i in range(p10f, p10l + 1)]
plt.yticks([10. ** p for p in pows], [round(10. ** p, 1) for p in pows])
plt.legend()
plt.show()
plt.clf()
Time measurement plots, x
and y
have logarithmic scale, x
signifies input integer bit length as a power of 2, y
signifies time in mcs
spent computing Murmur3Token.hash_fn(f(n))
where n
is the input random integer of x
bit length and f
is measured function one of hex
/pickle.dumps
/to_bytes
:
Console output, presents log output of time measured for each bit length and function, same data as on plot above expressed as text:
-8130291462150033527
1742844748022614520
-7849657241401383516
bit length = 1
var0 (hex ): 1.21 mcs
var1 (pickle ): 3.06 mcs, speedup 0.39x
var2 (to_bytes): 2.15 mcs, speedup 0.56x
bit length = 2
var0 (hex ): 1.25 mcs
var1 (pickle ): 3.04 mcs, speedup 0.41x
var2 (to_bytes): 2.14 mcs, speedup 0.58x
bit length = 4
var0 (hex ): 1.23 mcs
var1 (pickle ): 3.03 mcs, speedup 0.41x
var2 (to_bytes): 2.15 mcs, speedup 0.57x
bit length = 8
var0 (hex ): 1.24 mcs
var1 (pickle ): 3.08 mcs, speedup 0.4x
var2 (to_bytes): 2.19 mcs, speedup 0.56x
bit length = 16
var0 (hex ): 1.26 mcs
var1 (pickle ): 2.98 mcs, speedup 0.42x
var2 (to_bytes): 2.16 mcs, speedup 0.58x
bit length = 32
var0 (hex ): 1.31 mcs
var1 (pickle ): 3.05 mcs, speedup 0.43x
var2 (to_bytes): 2.18 mcs, speedup 0.6x
bit length = 64
var0 (hex ): 1.32 mcs
var1 (pickle ): 3.43 mcs, speedup 0.38x
var2 (to_bytes): 2.18 mcs, speedup 0.61x
bit length = 128
var0 (hex ): 1.40 mcs
var1 (pickle ): 3.44 mcs, speedup 0.41x
var2 (to_bytes): 2.22 mcs, speedup 0.63x
bit length = 256
var0 (hex ): 1.59 mcs
var1 (pickle ): 3.47 mcs, speedup 0.46x
var2 (to_bytes): 2.29 mcs, speedup 0.69x
bit length = 512
var0 (hex ): 1.97 mcs
var1 (pickle ): 3.70 mcs, speedup 0.53x
var2 (to_bytes): 2.47 mcs, speedup 0.8x
bit length = 1Ki
var0 (hex ): 2.69 mcs
var1 (pickle ): 4.02 mcs, speedup 0.67x
var2 (to_bytes): 2.84 mcs, speedup 0.95x
bit length = 2Ki
var0 (hex ): 4.43 mcs
var1 (pickle ): 5.35 mcs, speedup 0.83x
var2 (to_bytes): 3.45 mcs, speedup 1.28x
bit length = 4Ki
var0 (hex ): 7.27 mcs
var1 (pickle ): 5.96 mcs, speedup 1.22x
var2 (to_bytes): 5.16 mcs, speedup 1.41x
bit length = 8Ki
var0 (hex ): 13.66 mcs
var1 (pickle ): 8.31 mcs, speedup 1.64x
var2 (to_bytes): 8.37 mcs, speedup 1.63x
bit length = 16Ki
var0 (hex ): 25.39 mcs
var1 (pickle ): 14.27 mcs, speedup 1.78x
var2 (to_bytes): 13.78 mcs, speedup 1.84x
bit length = 32Ki
var0 (hex ): 48.91 mcs
var1 (pickle ): 24.59 mcs, speedup 1.99x
var2 (to_bytes): 23.80 mcs, speedup 2.06x
bit length = 64Ki
var0 (hex ): 95.75 mcs
var1 (pickle ): 43.23 mcs, speedup 2.21x
var2 (to_bytes): 44.24 mcs, speedup 2.16x
bit length = 128Ki
var0 (hex ): 189.91 mcs
var1 (pickle ): 81.09 mcs, speedup 2.34x
var2 (to_bytes): 84.14 mcs, speedup 2.26x
bit length = 256Ki
var0 (hex ): 376.56 mcs
var1 (pickle ): 155.73 mcs, speedup 2.42x
var2 (to_bytes): 164.22 mcs, speedup 2.29x
bit length = 512Ki
var0 (hex ): 781.47 mcs
var1 (pickle ): 318.82 mcs, speedup 2.45x
var2 (to_bytes): 324.04 mcs, speedup 2.41x
bit length = 1024Ki
var0 (hex ): 1503.77 mcs
var1 (pickle ): 608.79 mcs, speedup 2.47x
var2 (to_bytes): 648.54 mcs, speedup 2.32x
Upvotes: 1