Reputation: 2912
I am trying to create a 'mask' of a numpy.array by specifying certain criteria. Python even has nice syntax for something like this:
>> A = numpy.array([1,2,3,4,5])
>> A > 3
array([False, False, False, True, True])
But if I have a list of criteria instead of a range:
>> A = numpy.array([1,2,3,4,5])
>> crit = [1,3,5]
I can't do this:
>> A in crit
I have to do something based on list comprehensions, like this:
>> [a in crit for a in A]
array([True, False, True, False, True])
Which is correct.
Now, the problem is that I am working with large arrays and the above code is very slow. Is there a more natural way of doing this operation that might speed it up?
EDIT: I was able to get a small speedup by making crit into a set.
EDIT2: For those who are interested:
Jouni's approach: 1000 loops, best of 3: 102 µs per loop
numpy.in1d: 1000 loops, best of 3: 1.33 ms per loop
EDIT3: Just tested again with B = randint(10,size=100)
Jouni's approach: 1000 loops, best of 3: 2.96 ms per loop
numpy.in1d: 1000 loops, best of 3: 1.34 ms per loop
Conclusion: Use numpy.in1d() unless B is very small.
Upvotes: 7
Views: 1289
Reputation: 47072
I think that the numpy function in1d
is what you are looking for:
>>> A = numpy.array([1,2,3,4,5])
>>> B = [1,3,5]
>>> numpy.in1d(A,crit)
array([ True, False, True, False, True], dtype=bool)
as stated in its docstring, "in1d(a, b)
is roughly equivalent to np.array([item in b for item in a])
"
Admittedly, I haven't done any speed tests, but it sounds like what you are looking for.
Another faster way
Here's another way to do it which is faster. Sort the B array first(containing the elements you are looking to find in A), turn it into a numpy array, and then do:
B[B.searchsorted(A)] == A
though if you have elements in A that are larger than the largest in B, you will need to do:
inds = B.searchsorted(A)
inds[inds == len(B)] = 0
mask = B[inds] == A
It may not be faster for small arrays (especially for B being small), but before long it will definitely be faster. Why? Because this is a O(N log M) algorithm, where N is the number of elements in A and M is the number of elements in M, putting together a bunch of individual masks is O(N * M). I tested it with N = 10000 and M = 14 and it was already faster. Anyway, just thought that you might like to know, especially if you are truly planning on using this on very large arrays.
Upvotes: 6
Reputation: 17711
Create a mask and use the compress function of the numpy array. It should be much faster. If you have a complex criteria, remember to construct it based on math of the arrays.
a = numpy.array([3,1,2,4,5])
mask = a > 3
b = a.compress(mask)
or
a = numpy.random.random_integers(1,5,100000)
c=a.compress((a<=4)*(a>=2)) ## numbers between n<=4 and n>=2
d=a.compress(~((a<=4)*(a>=2))) ## numbers either n>4 or n<2
Ok, if you want a mask that has all a in [1,3,5] you can do something like
a = numpy.random.random_integers(1,5,100000)
mask=(a==1)+(a==3)+(a==5)
or
a = numpy.random.random_integers(1,5,100000)
mask = numpy.zeros(len(a), dtype=bool)
for num in [1,3,5]:
mask += (a==num)
Upvotes: 0
Reputation: 44118
Combine several comparisons with "or":
A = randint(10,size=10000)
mask = (A == 1) | (A == 3) | (A == 5)
Or if you have a list B and want to create the mask dynamically:
B = [1, 3, 5]
mask = zeros((10000,),dtype=bool)
for t in B: mask = mask | (A == t)
Upvotes: 3