onepint16oz
onepint16oz

Reputation: 294

Efficiently creating masks - Numpy /Python

Suppose I have a NumPy array with shape (50, 10000, 10000) with 1000 distinct "clusters". For example, there would be small volume somewhere with just 1s, another small volume with 2s, etc. I would like to iterate through each cluster to create a mask like so:

for i in np.unique(arr)[1:]:
    mask = arr == i
    #do other stuff with mask

Creating each mask takes about 15 seconds, and iterating through 1000 clusters would take more than 4 hours. Is there a possible way to speed up the code or is this the best there is since there is no avoiding iterating through each element of the array?

EDIT: the dtype of the array is uint16

Upvotes: 2

Views: 2308

Answers (1)

Daniel F
Daniel F

Reputation: 14399

I'm assuming arr is sparse:

  • you say the clusters are small, and 1000 clusters isn't going to tile an array that big
  • you iterate over np.unique(arr)[1:], so I assume the first unique value is 0

In this case I would recommend leveraging a scipy.sparse.csr_matrix

from scipy.sparse import csr_matrix
sp_arr = csr_matrix(arr.reshape(1,-1))

This turns your big dense array into a one-row compressed sparse row array. Since sparse arrays don't like more than 2 dimensions, this tricks it into using ravelled indices. Now sp_arr has data (the cluster labels), indices (the ravelled indices), and indptr (which is trivial here since we only have one row). So,

for i in np.unique(sp_arr.data):  # as a bonus this `unique` call should be faster too
    x, y, z = np.unravel_index(sp_arr.indices[sp_arr.data == i], arr.shape)

Should much more efficiently give equivalent coordinates to

for i in np.unique(arr)[:1]:
    x, y, z = np.nonzero(arr == i)

where x, y, z are the indices of the True values in mask. From there you can either reconstruct mask or work off the indices (recommended).

You could also do this purely with numpy, and still have a boolean mask at the end, but a bit less memory efficient:

all_mask = arr != 0    # points assigned to any cluster
data = arr[all_mask]   # all cluster labels
for i in np.unique(data):
    mask = all_mask.copy()
    mask[mask] = data == i  # now mask is same as before

Upvotes: 1

Related Questions