Reputation: 294
Suppose I have a NumPy array with shape (50, 10000, 10000) with 1000 distinct "clusters". For example, there would be small volume somewhere with just 1s, another small volume with 2s, etc. I would like to iterate through each cluster to create a mask like so:
for i in np.unique(arr)[1:]:
mask = arr == i
#do other stuff with mask
Creating each mask takes about 15 seconds, and iterating through 1000 clusters would take more than 4 hours. Is there a possible way to speed up the code or is this the best there is since there is no avoiding iterating through each element of the array?
EDIT: the dtype of the array is uint16
Upvotes: 2
Views: 2308
Reputation: 14399
I'm assuming arr
is sparse:
np.unique(arr)[1:]
, so I assume the first unique value is 0
In this case I would recommend leveraging a scipy.sparse.csr_matrix
from scipy.sparse import csr_matrix
sp_arr = csr_matrix(arr.reshape(1,-1))
This turns your big dense array into a one-row compressed sparse row array. Since sparse arrays don't like more than 2 dimensions, this tricks it into using ravelled indices. Now sp_arr
has data
(the cluster labels), indices
(the ravelled indices), and indptr
(which is trivial here since we only have one row). So,
for i in np.unique(sp_arr.data): # as a bonus this `unique` call should be faster too
x, y, z = np.unravel_index(sp_arr.indices[sp_arr.data == i], arr.shape)
Should much more efficiently give equivalent coordinates to
for i in np.unique(arr)[:1]:
x, y, z = np.nonzero(arr == i)
where x, y, z
are the indices of the True
values in mask
. From there you can either reconstruct mask
or work off the indices (recommended).
You could also do this purely with numpy
, and still have a boolean mask at the end, but a bit less memory efficient:
all_mask = arr != 0 # points assigned to any cluster
data = arr[all_mask] # all cluster labels
for i in np.unique(data):
mask = all_mask.copy()
mask[mask] = data == i # now mask is same as before
Upvotes: 1