Using index arrays vs iterating in numpy

Question

I have many numpy arrays of about 50K elements. I want to compare them, using only certain positions of them (10% of them in the average), and performance matters. This looks like a good use case for index arrays. I can write this code:

def equal_1(array1, array2, index):
    return (array1[index] == array2[index]).all():

That is fast in practice, but it iterates for all indexes once per array.

I can use this other approach too:

def equal_2(array1, array2, index):
    for i in index:
        if array1[i] != array2[i]:
            return False
    return True

This only iterates arrays until a difference is found.

I benchmarked both approaches for my use case.

In arrays that are equal, or where differences are at the end, the index array function is about 30 times faster. When there are differences at the beginning of the array, the second function is about 30 times faster.

Is there a way to get the best of both worlds (numpy speed + second function laziness)?

James · Accepted Answer

For your purposes, you may want to use the just-in-time compiler @jit from numba.

import numpy as np
from numba import jit

a1 = np.arange(50000)
a2 = np.arange(50000)
# set some values to evaluation as false
a2[40000:45000] = 1
indices = np.random.choice(np.arange(50000), replace=False, size=5000)
indices.sort()

def equal_1(array1, array2, index):
    return (array1[index] == array2[index]).all()

def equal_2(array1, array2, index):
    for i in index:
        if array1[i] != array2[i]:
            return False
    return True

@jit  #just as this decorator to your function
def equal_3(array1, array2, index):
    for i in index:
        if array1[i] != array2[i]:
            return False
    return True

testing:

In [44]: %%timeit -n10 -r1
    ...: equal_1(a1,a2,indices)
    ...:
10 loops, best of 1: 72.6 µs per loop

In [45]: %%timeit -n10 -r1
    ...: equal_2(a1,a2,indices)
    ...:
10 loops, best of 1: 657 µs per loop

In [46]: %%timeit -n10 -r1
    ...: equal_3(a1,a2,indices)
    ...:
10 loops, best of 1: 7.65 µs per loop

Just by adding @jit you can get a ~100x speed up in your python operation.

Using index arrays vs iterating in numpy

Answers (1)

Related Questions