Canol Gökel
Canol Gökel

Reputation: 1275

Getting the coordinates of elements in clusters without a loop in numpy

I have a 2D array, where I label clusters using the ndimage.label() function like this:

import numpy as np
from scipy.ndimage import label

input_array = np.array([[0, 1, 1, 0],
                        [1, 1, 0, 0],
                        [0, 0, 0, 1],
                        [0, 0, 0, 1]])

labeled_array, _ = label(input_array)

# Result:
# labeled_array == [[0, 1, 1, 0],
#                   [1, 1, 0, 0],
#                   [0, 0, 0, 2],
#                   [0, 0, 0, 2]]

I can get the element counts, the centroids or the bounding box of the labeled clusters. But I would like to also get the coordinates of each element in clusters. Something like this (the data structure doesn't have to be like this, any data structure is okay):

{
    1: [(0, 1), (0, 2), (1, 0), (1, 1)],  # Coordinates of the elements that have the label "1"
    2: [(2, 3), (3, 3)]  # Coordinates of the elements that have the label "2"
}

I can loop over the label list and call np.where() for each one of them but I wonder if there is a way to do this without a loop, so that it would be faster?

Upvotes: 2

Views: 577

Answers (2)

Mad Physicist
Mad Physicist

Reputation: 114230

You can make a map of the coordinates, sort and split it:

# Get the indexes (coordinates) of the labeled (non-zero) elements
ind = np.argwhere(labeled_array)

# Get the labels corresponding to those indexes above
labels = labeled_array[tuple(ind.T)]

# Sort both arrays so that lower label numbers appear before higher label numbers. This is not for cosmetic reasons,
# but we will use sorted nature of these label indexes when we use the "diff" method in the next step.
sort = labels.argsort()
ind = ind[sort]
labels = labels[sort]

# Find the split points where a new label number starts in the ordered label numbers
splits = np.flatnonzero(np.diff(labels)) + 1

# Create a data structure out of the label numbers and indexes (coordinates).
# The first argument to the zip is: we take the 0th label number and the label numbers at the split points
# The second argument is the indexes (coordinates), split at split points
# so the length of both arguments to the zip function is the same
result = {k: v for k, v in zip(labels[np.r_[0, splits]],
                               np.split(ind, splits))}

Upvotes: 5

Scott Boston
Scott Boston

Reputation: 153460

Method 1:

You can try this, still looping using dictionary comprehension:

{k: list(zip(*np.where(labeled_array == k))) for k in range(1,3)}

Output:

{1: [(0, 1), (0, 2), (1, 0), (1, 1)], 2: [(2, 3), (3, 3)]}

Method 2 (slow):

Here's a way using pandas probably slower that Mad Physicist's method:

(pd.DataFrame(labeled_array)
  .stack() 
  .reset_index()
  .groupby(0).agg(list)[1:]
  .apply(lambda x: list(zip(*x)), axis=1)
).to_dict()

Output:

{1: [(0, 1), (0, 2), (1, 0), (1, 1)], 2: [(2, 3), (3, 3)]}

Timings with this data:

Dictionary comprehension

8.73 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using map coordinates, sort and split:

57.3 µs ± 5.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

pandas

5.16 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 1

Related Questions