Reputation: 3832
I am looking to apply a function to each row of a numpy array. If this function evaluates to True I will keep the row, otherwise I will discard it. For example, my function might be:
def f(row):
if sum(row)>10: return True
else: return False
I was wondering if there was something similar to:
np.apply_over_axes()
which applies a function to each row of a numpy array and returns the result. I was hoping for something like:
np.filter_over_axes()
which would apply a function to each row of a numpy array and only return rows for which the function returned True. Is there anything like this? Or should I just use a for loop?
Upvotes: 41
Views: 63483
Reputation: 23509
As @Roger Fan mentioned, applying a function row-wise should really be done in a vectorized fashion on the entire array. The canonical way to filter is to construct a boolean mask and apply it on the array. That said, if it happens that the function is so complex that vectorization is not possible, it's better/faster to convert the array into a Python list (especially if it uses Python functions such as sum()
) and apply the function on it.
msk = arr.sum(axis=1)>10 # best way to create a boolean mask
msk = [f(row) for row in arr.tolist()] # second best way
# ^^^^^^^^ <---- convert to list
filtered_arr = arr[msk] # filtered via boolean indexing
As you can see from the timeit test below, looping over a list (arr.tolist()
) is much faster than looping over a numpy array (arr
), partly because Python's sum()
and not np.sum()
is called in the function f()
. That said, the vectorized method is much faster than both.
def f(row):
if sum(row)>10: return True
else: return False
arr = np.random.rand(10000, 200)
%timeit arr[[f(row) for row in arr]]
# 260 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr[[f(row) for row in arr.tolist()]]
# 114 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr[arr.sum(axis=1)>10]
# 10.8 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 0
Reputation: 5045
Ideally, you would be able to implement a vectorized version of your function and use that to do boolean indexing. For the vast majority of problems this is the right solution. Numpy provides quite a few functions that can act over various axes as well as all the basic operations and comparisons, so most useful conditions should be vectorizable.
import numpy as np
x = np.random.randn(20, 3)
x_new = x[np.sum(x, axis=1) > .5]
If you are absolutely sure that you can't do the above, I would suggest using a list comprehension (or np.apply_along_axis
) to create an array of bools to index with.
def myfunc(row):
return sum(row) > .5
bool_arr = np.array([myfunc(row) for row in x])
x_new = x[bool_arr]
This will get the job done in a relatively clean way, but will be significantly slower than a vectorized version. An example:
x = np.random.randn(5000, 200)
%timeit x[np.sum(x, axis=1) > .5]
# 100 loops, best of 3: 5.71 ms per loop
%timeit x[np.array([myfunc(row) for row in x])]
# 1 loops, best of 3: 217 ms per loop
Upvotes: 41