CBowman
CBowman

Reputation: 237

Vectorizing the conversion of columns in a 2D numpy array to byte strings

Background

I have a 2D numpy array which represents a large number of grid-coordinate vectors, and each of these coordinate vectors need to be converted to byte strings so they can be converted into a python set.

This byte-string conversion process is a real bottleneck in my code's run-time, so I'm looking for ways to speed it up.

Example code

from numpy import int16
from numpy.random import randint
# make an array of coordinate vectors full of random ints
A = randint(-100,100,size = (10000,5), dtype=int16)
# pull each vector out of the array using iteration and convert to byte string
A = [v.tobytes() for v in A]
# build a set using the byte strings
S = set(A)

Timing tests

Using timeit to test the current code we get

setup = 'from numpy import int16; from numpy.random import randint; A = randint(-100,100,size = (10000,5), dtype=int16)'
code = 'S = set([v.tobytes() for v in A])'
t = timeit(code, setup = setup, number=500)
print(t)
>>> 1.136594653999964

Building the set after the conversion is less than 15% of the total computation time:

setup = 'from numpy import int16; from numpy.random import randint; A = randint(-100,100,size = (10000,5), dtype=int16); A = [v.tobytes() for v in A]'
code = 'S = set(A)'
t = timeit(code, setup = setup, number=500)
print(t)
>>> 0.15499859599980482

It's also worth noting that doubling the size of the integers to 32 bit has only a small effect on the run time:

setup = 'from numpy import int32; from numpy.random import randint; A = randint(-100,100,size = (10000,5), dtype=int32)'
code = 'S = set([v.tobytes() for v in A])'
t = timeit(code, setup = setup, number=500)
print(t)
>>> 1.1422132620000411

This leads me to suspect that most of the time here is being eaten up in overhead of either the iteration or the function call to tostring().

If that's the case, I'm wondering is there a vectorized way of doing this that avoids the iteration?

Thanks!

Upvotes: 1

Views: 239

Answers (1)

Divakar
Divakar

Reputation: 221624

Here's a vectorized method using np.frombuffer -

# a : Input array of coordinates with int16 dtype
S = set(np.frombuffer(a,dtype='S'+str(a.shape[1]*2)))

Timings on given sample dataset -

In [83]: np.random.seed(0)
    ...: a = randint(-100,100,size = (10000,5), dtype=int16)

In [128]: %timeit set([v.tobytes() for v in a])
2.71 ms ± 99.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [129]: %timeit set(np.frombuffer(a,dtype='S'+str(a.shape[1]*2)))
933 µs ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [130]: out1 = set([v.tobytes() for v in a])

In [131]: out2 = set(np.frombuffer(a,dtype='S'+str(a.shape[1]*2)))

In [132]: (np.sort(list(out1))==np.sort(list(out2))).all()
Out[132]: True

Upvotes: 1

Related Questions