Reputation: 9390
I have many numpy arrays of about 50K elements. I want to compare them, using only certain positions of them (10% of them in the average), and performance matters. This looks like a good use case for index arrays. I can write this code:
def equal_1(array1, array2, index):
return (array1[index] == array2[index]).all():
That is fast in practice, but it iterates for all indexes once per array.
I can use this other approach too:
def equal_2(array1, array2, index):
for i in index:
if array1[i] != array2[i]:
return False
return True
This only iterates arrays until a difference is found.
I benchmarked both approaches for my use case.
In arrays that are equal, or where differences are at the end, the index array function is about 30 times faster. When there are differences at the beginning of the array, the second function is about 30 times faster.
Is there a way to get the best of both worlds (numpy speed + second function laziness)?
Upvotes: 0
Views: 85
Reputation: 36691
For your purposes, you may want to use the just-in-time compiler @jit
from numba
.
import numpy as np
from numba import jit
a1 = np.arange(50000)
a2 = np.arange(50000)
# set some values to evaluation as false
a2[40000:45000] = 1
indices = np.random.choice(np.arange(50000), replace=False, size=5000)
indices.sort()
def equal_1(array1, array2, index):
return (array1[index] == array2[index]).all()
def equal_2(array1, array2, index):
for i in index:
if array1[i] != array2[i]:
return False
return True
@jit #just as this decorator to your function
def equal_3(array1, array2, index):
for i in index:
if array1[i] != array2[i]:
return False
return True
testing:
In [44]: %%timeit -n10 -r1
...: equal_1(a1,a2,indices)
...:
10 loops, best of 1: 72.6 µs per loop
In [45]: %%timeit -n10 -r1
...: equal_2(a1,a2,indices)
...:
10 loops, best of 1: 657 µs per loop
In [46]: %%timeit -n10 -r1
...: equal_3(a1,a2,indices)
...:
10 loops, best of 1: 7.65 µs per loop
Just by adding @jit
you can get a ~100x speed up in your python operation.
Upvotes: 1