user7468395
user7468395

Reputation: 1349

Apply np.where against square bracket filtering for numpy filtering

I could perform filtering of numpy arrays via

a[np.where(a[:,0]==some_expression)]

or

a[a[:,0]==some_expression]

What are the (dis)advantages of each of these versions - especially with regard to performance?

Upvotes: 3

Views: 904

Answers (2)

jpp
jpp

Reputation: 164623

Boolean indexing is transformed into integer indexing internally. This is indicated in the docs:

In general if an index includes a Boolean array, the result will be identical to inserting obj.nonzero() into the same position and using the integer array indexing mechanism described above.

So the complexity of the two approaches is the same. But np.where is more efficient for large arrays:

np.random.seed(0)
a = np.random.randint(0, 10, (10**7, 1))
%timeit a[np.where(a[:, 0] == 5)]  # 50.1 ms per loop
%timeit a[a[:, 0] == 5]            # 62.6 ms per loop

Now np.where has other benefits: advanced integer indexing works well across multiple dimensions. For an example where Boolean indexing is unintuitive in this aspect, see NumPy indexing: broadcasting with Boolean arrays. Since np.where is more efficient than Boolean indexing, this is just an extra reason it should be preferred.

Upvotes: 3

Mr_and_Mrs_D
Mr_and_Mrs_D

Reputation: 34016

To my surprise, the first one seems to perform slightly better:

a = np.random.random_integers(100, size=(1000,1))

import timeit

repeat = 3
numbers = 1000

def time(statement, _setup=None):
  print(min(
    timeit.Timer(statement, setup=_setup or setup).repeat(repeat, numbers)))

setup = """from __main__ import np, a"""

time('a[np.where(a[:,0]==99)]')
time('a[(a[:,0]==99)]')

prints (for instance):

0.017856399000000023
0.019185326999999974

Increasing the size of the array makes the numbers differ even more

Upvotes: 1

Related Questions