Reputation: 8090

Efficient way to get a subset of indices in numpy

I have the following indices as you would get them from np.where(...):

coords = (
  np.asarray([0 0 0 1 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 5 5 6 6 6]),
  np.asarray([2 2 8 2 2 4 4 6 2 2 6 2 2 4 6 2 2 6 2 2 4 4 6 2 2 6]),
  np.asarray([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]),
  np.asarray([0 1 0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1])
)

Another tuple with indices is meant to select those that are in coords:

index = tuple(
  np.asarray([0 0 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 5 5 6 6 6]),
  np.asarray([2 8 2 4 4 6 2 2 6 2 2 4 6 2 2 6 2 2 4 4 6 2 2 6]),
  np.asarray([0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]),
  np.asarray([0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1])
)

So for instance, coords[0] is selected because it's in index (at position 0), but coords[1] isn't selected because it's not available in index.

I can calculate the mask easily with [x in zip(*index) for x in zip(*coords)] (converted from bool to int for better readability):

[1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

but this wouldn't be very efficient for larger arrays. Is there a more "numpy-based" way that could calculate the mask?

Upvotes: 1

Answers (2)

Paul Panzer

Reputation: 53099

You can use np.ravel_multi_index to compress the columns into unique numbers which are easier to handle:

cmx = *map(np.max, coords),
imx = *map(np.max, index),
shape = np.maximum(cmx, imx) + 1

ct = np.ravel_multi_index(coords, shape)
it = np.ravel_multi_index(index, shape)

it.sort()

result = ct == it[it.searchsorted(ct)]
print(result.view(np.int8))

Prints:

[1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Upvotes: 1

filippo

Reputation: 5294

Not so sure about efficiency but given you're basically comparing coordinates pairs you could use scipy distance functions. Something along:

from scipy.spatial.distance import cdist

c = np.stack(coords).T
i = np.stack(index).T

d = cdist(c, i)

In [113]: np.any(d == 0, axis=1).astype(int)
Out[113]: 
array([1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1])

By default it uses L2 norm, you could probably make it slightly faster with a simpler distance function, e.g.:

d = cdist(c,i, lambda u, v: np.all(np.equal(u,v)))
np.any(d != 0, axis=1).astype(int)

Upvotes: 1

Efficient way to get a subset of indices in numpy

Answers (2)

Related Questions