Reputation: 11201
I'm looking to compactly represent duplicates in a Python list / 1D numpy array. For instance, say we have
x = np.array([1, 0, 0, 3, 3, 0])
this array has several duplicate elements, that can be represented with a
group_id = np.array([0, 1, 1, 2, 2, 1])
so that all duplicates in a given cluster are found with x[group_id==<some_id>]
.
The list of duplicate pairs can be efficiently computed with sorting,
s_idx = np.argsort(x)
diff_idx = np.nonzero(x[s_idx[:-1]] == x[s_idx[1:]])[0]
where the pair s_idx[diff_idx]
<-> s_idx[diff_idx+1]
correspond to the indices in the original array that are duplicates.
(here array([1, 2, 3])
<-> array([2, 5, 4])
).
However, I'm not sure how to efficiently calculate cluster_id
from this linkage information for large arrays sizes (N > 10⁶
).
Edit: as suggested by @Chris_Rands, this can indeed be done with itertools.groupby
,
import numpy as np
import itertools
def get_group_id(x):
group_id = np.zeros(x.shape, dtype='int')
for i, j in itertools.groupby(x):
j_el = next(j)
group_id[x==j_el] = i
return group_id
however the scaling appears to be O(n^2), and this would not scale to my use case (N > 10⁶
),
for N in [50000, 100000, 200000]:
%time _ = get_group_id(np.random.randint(0, N, size=N))
CPU times: total: 1.53 s
CPU times: total: 5.83 s
CPU times: total: 23.9 s
and I belive using the duplicate linkage information would be more efficient as computing duplicate pairs for N=200000
takes just 6.44 µs in comparison.
Upvotes: 0
Views: 162
Reputation: 221564
Here's an approach using np.unique
to keep the order according to the first appearance of a number -
unq, first_idx, ID = np.unique(x,return_index=1,return_inverse=1)
out = first_idx.argsort().argsort()[ID]
Sample run -
In [173]: x
Out[173]: array([1, 0, 0, 3, 3, 0, 9, 0, 2, 6, 0, 0, 4, 8])
In [174]: unq, first_idx, ID = np.unique(x,return_index=1,return_inverse=1)
In [175]: first_idx.argsort().argsort()[ID]
Out[175]: array([0, 1, 1, 2, 2, 1, 3, 1, 4, 5, 1, 1, 6, 7])
Upvotes: 1
Reputation: 114821
You could use numpy.unique
:
In [13]: x = np.array([1, 0, 0, 3, 3, 0])
In [14]: values, cluster_id = np.unique(x, return_inverse=True)
In [15]: values
Out[15]: array([0, 1, 3])
In [16]: cluster_id
Out[16]: array([1, 0, 0, 2, 2, 0])
(The cluster IDs are assigned in the order of the sorted unique values, not in the order of a value's first appearance in the input.)
Locations of the items in cluster 0:
In [22]: cid = 0
In [23]: values[cid]
Out[23]: 0
In [24]: (cluster_id == cid).nonzero()[0]
Out[24]: array([1, 2, 5])
Upvotes: 1