Reputation: 237
Background
I have a 2D numpy array which represents a large number of grid-coordinate vectors, and each of these coordinate vectors need to be converted to byte strings so they can be converted into a python set.
This byte-string conversion process is a real bottleneck in my code's run-time, so I'm looking for ways to speed it up.
Example code
from numpy import int16
from numpy.random import randint
# make an array of coordinate vectors full of random ints
A = randint(-100,100,size = (10000,5), dtype=int16)
# pull each vector out of the array using iteration and convert to byte string
A = [v.tobytes() for v in A]
# build a set using the byte strings
S = set(A)
Timing tests
Using timeit
to test the current code we get
setup = 'from numpy import int16; from numpy.random import randint; A = randint(-100,100,size = (10000,5), dtype=int16)'
code = 'S = set([v.tobytes() for v in A])'
t = timeit(code, setup = setup, number=500)
print(t)
>>> 1.136594653999964
Building the set after the conversion is less than 15% of the total computation time:
setup = 'from numpy import int16; from numpy.random import randint; A = randint(-100,100,size = (10000,5), dtype=int16); A = [v.tobytes() for v in A]'
code = 'S = set(A)'
t = timeit(code, setup = setup, number=500)
print(t)
>>> 0.15499859599980482
It's also worth noting that doubling the size of the integers to 32 bit has only a small effect on the run time:
setup = 'from numpy import int32; from numpy.random import randint; A = randint(-100,100,size = (10000,5), dtype=int32)'
code = 'S = set([v.tobytes() for v in A])'
t = timeit(code, setup = setup, number=500)
print(t)
>>> 1.1422132620000411
This leads me to suspect that most of the time here is being eaten up in overhead of either the iteration or the function call to tostring()
.
If that's the case, I'm wondering is there a vectorized way of doing this that avoids the iteration?
Thanks!
Upvotes: 1
Views: 239
Reputation: 221624
Here's a vectorized method using np.frombuffer
-
# a : Input array of coordinates with int16 dtype
S = set(np.frombuffer(a,dtype='S'+str(a.shape[1]*2)))
Timings on given sample dataset -
In [83]: np.random.seed(0)
...: a = randint(-100,100,size = (10000,5), dtype=int16)
In [128]: %timeit set([v.tobytes() for v in a])
2.71 ms ± 99.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [129]: %timeit set(np.frombuffer(a,dtype='S'+str(a.shape[1]*2)))
933 µs ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [130]: out1 = set([v.tobytes() for v in a])
In [131]: out2 = set(np.frombuffer(a,dtype='S'+str(a.shape[1]*2)))
In [132]: (np.sort(list(out1))==np.sort(list(out2))).all()
Out[132]: True
Upvotes: 1