Reputation: 5011
I am pretty new to numpy, I also am using pypy 2.2 which has limited numpy support (see http://buildbot.pypy.org/numpy-status/latest.html) but what I'm trying to do is filter an array by a set of values (i.e keep subarray if it contains a value in a set). I can do with a list comprehension but I'd rather do without the intermediate list as on longer arrays it isn't fast and I can't help but think numpy filtering will be faster.
>> a = np.array([[ 368, 322, 175238, 2],
[ 430, 382, 121486, 2],
[ 451, 412, 153521, 2],
[ 480, 442, 121468, 2],
[ 517, 475, 109543, 2],
[ 543, 503, 121471, 2],
[ 576, 537, 100566, 2],
[ 607, 567, 121473, 2],
[ 640, 597, 153561, 2]])
>> b = {121486, 153521, 121473}
>> np.array([x for x in a if x[2] in b])
>> array([[ 430, 382, 121486, 2],
[ 451, 412, 153521, 2],
[ 607, 567, 121473, 2]])
Upvotes: 8
Views: 7837
Reputation: 26886
For relatively small inputs like those in your question, the fastest method is by far and large the naïve one:
np.array([x for x in a if x[2] in b])
This would be true especially for PyPy.
For larger inputs, @askewchan solution with NumPy may be faster:
a[np.in1d(a[:,2], list(b))]
However, when using CPython, a Numba-based implementation would be even faster (at all scale):
import numpy as np
import numba as nb
@nb.jit
def custom_filter(arr, values):
values = set(values)
n, m = arr.shape
result = np.empty((n, m), dtype=arr.dtype)
k = 0
for i in range(n):
if arr[i, 2] in values:
result[k, :] = arr[i, :]
k += 1
return result[:k, :].copy()
@nb.jit
def custom_filter2(arr, values):
values = set(values)
n, m = arr.shape
k = 0
for i in range(n):
if arr[i, 2] in values:
k += 1
result = np.empty((k, m), dtype=arr.dtype)
k = 0
for i in range(n):
if arr[i, 2] in values:
result[k, :] = arr[i, :]
k += 1
return result
A quick glimpse into benchmarks:
aa = np.tile(a, (1000, 1))
bb = set(list(range(121000, 122000)))
%timeit np.array([x for x in a if x[2] in b])
# 8.54 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit custom_filter(a, tuple(b))
# 1.59 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit custom_filter2(a, tuple(b))
# 1.45 µs ± 21.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit a[np.in1d(a[:,2], tuple(b))]
# 25.2 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([x for x in aa if x[2] in b])
# 6.76 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit custom_filter(aa, tuple(b))
# 90.6 µs ± 3.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit custom_filter2(aa, tuple(b))
# 135 µs ± 5.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit aa[np.in1d(aa[:, 2], tuple(b))]
# 147 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([x for x in aa if x[2] in bb])
# 7.26 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit custom_filter(aa, tuple(bb))
# 226 µs ± 5.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit custom_filter2(aa, tuple(bb))
# 278 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit aa[np.in1d(aa[:, 2], tuple(bb))]
# 756 µs ± 62.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 0
Reputation: 46530
You can do it in one line, but you have to use list(b)
, so it might not actually be any faster:
>>> a[np.in1d(a[:,2], list(b))]
array([[ 430, 382, 121486, 2],
[ 451, 412, 153521, 2],
[ 607, 567, 121473, 2]])
It works because np.in1d
tells you which of the first item are in the second:
>>> np.in1d(a[:,2], list(b))
array([False, True, True, False, False, False, False, True, False], dtype=bool)
For large a
and b
, this is probably faster than your solution, as it still uses b
as a set, but builds only boolean array instead of rebuilding the entire array one line at a time. For large a
and small b
, I think np.in1d
might be faster.
ainb = np.array([x in b for x in a[:,2]])
a[ainb]
For small a
and large b
, your own solution is probably fastest.
Upvotes: 7