D G
D G

Reputation: 313

Create mask for numpy array based on values' set membership

I want to create a 'mask' index array for an array, based on whether the elements of that array are members of some set. What I want can be achieved as follows:

x = np.arange(20)
interesting_numbers = {1, 5, 7, 17, 18}
x_mask = np.array([xi in interesting_numbers for xi in x])

I'm wondering if there's a faster way to execute that last line. As it is, it builds a list in Python by repeatedly calling a __contains__ method, then converts that list to a numpy array.

I want something like x_mask = x[x in interesting_numbers] but that's not valid syntax.

Upvotes: 1

Views: 629

Answers (2)

Divakar
Divakar

Reputation: 221574

Here's one approach with np.searchsorted -

def set_membership(x, interesting_numbers):
    b = np.sort(list(interesting_numbers))
    idx = np.searchsorted(b, x)
    idx[idx==b.size] = 0
    return b[idx] == x

Runtime test -

# Setup inputs with random numbers that are not necessarily sorted
In [353]: x = np.random.choice(100000, 10000, replace=0)

In [354]: interesting_numbers = set(np.random.choice(100000, 1000, replace=0))

In [355]: x_mask = np.array([xi in interesting_numbers for xi in x])

# Verify output with set_membership
In [356]: np.allclose(x_mask, set_membership(x, interesting_numbers))
Out[356]: True

# @Psidom's solution
In [357]: %timeit np.in1d(x, list(interesting_numbers))
1000 loops, best of 3: 1.04 ms per loop

In [358]: %timeit set_membership(x, interesting_numbers)
1000 loops, best of 3: 682 µs per loop

Upvotes: 1

akuiper
akuiper

Reputation: 214967

You can use np.in1d:

np.in1d(x, list(interesting_numbers))
#array([False,  True, False, False, False,  True, False,  True, False,
#       False, False, False, False, False, False, False, False,  True,
#        True, False], dtype=bool)

Timing, it is faster if the array x is large:

x = np.arange(10000)
interesting_numbers = {1, 5, 7, 17, 18}

%timeit np.in1d(x, list(interesting_numbers))
# 10000 loops, best of 3: 41.1 µs per loop

%timeit x_mask = np.array([xi in interesting_numbers for xi in x])
# 1000 loops, best of 3: 1.44 ms per loop

Upvotes: 3

Related Questions