Reputation: 1349
I could perform filtering of numpy arrays via
a[np.where(a[:,0]==some_expression)]
or
a[a[:,0]==some_expression]
What are the (dis)advantages of each of these versions - especially with regard to performance?
Upvotes: 3
Views: 904
Reputation: 164623
Boolean indexing is transformed into integer indexing internally. This is indicated in the docs:
In general if an index includes a Boolean array, the result will be identical to inserting
obj.nonzero()
into the same position and using the integer array indexing mechanism described above.
So the complexity of the two approaches is the same. But np.where
is more efficient for large arrays:
np.random.seed(0)
a = np.random.randint(0, 10, (10**7, 1))
%timeit a[np.where(a[:, 0] == 5)] # 50.1 ms per loop
%timeit a[a[:, 0] == 5] # 62.6 ms per loop
Now np.where
has other benefits: advanced integer indexing works well across multiple dimensions. For an example where Boolean indexing is unintuitive in this aspect, see NumPy indexing: broadcasting with Boolean arrays. Since np.where
is more efficient than Boolean indexing, this is just an extra reason it should be preferred.
Upvotes: 3
Reputation: 34016
To my surprise, the first one seems to perform slightly better:
a = np.random.random_integers(100, size=(1000,1))
import timeit
repeat = 3
numbers = 1000
def time(statement, _setup=None):
print(min(
timeit.Timer(statement, setup=_setup or setup).repeat(repeat, numbers)))
setup = """from __main__ import np, a"""
time('a[np.where(a[:,0]==99)]')
time('a[(a[:,0]==99)]')
prints (for instance):
0.017856399000000023
0.019185326999999974
Increasing the size of the array makes the numbers differ even more
Upvotes: 1