Efficiently handling duplicates in a Python list

Question

I'm looking to compactly represent duplicates in a Python list / 1D numpy array. For instance, say we have

 x = np.array([1, 0, 0, 3, 3, 0])

this array has several duplicate elements, that can be represented with a

 group_id = np.array([0, 1, 1, 2, 2, 1])

so that all duplicates in a given cluster are found with x[group_id==].

The list of duplicate pairs can be efficiently computed with sorting,

s_idx = np.argsort(x)
diff_idx = np.nonzero(x[s_idx[:-1]] == x[s_idx[1:]])[0]

where the pair s_idx[diff_idx] <-> s_idx[diff_idx+1] correspond to the indices in the original array that are duplicates. (here array([1, 2, 3]) <-> array([2, 5, 4])).

However, I'm not sure how to efficiently calculate cluster_id from this linkage information for large arrays sizes (N > 10⁶).

Edit: as suggested by @Chris_Rands, this can indeed be done with itertools.groupby,

 import numpy as np
 import itertools

 def get_group_id(x):
     group_id = np.zeros(x.shape, dtype='int')
     for i, j in  itertools.groupby(x):
         j_el = next(j)
         group_id[x==j_el] = i
     return group_id

however the scaling appears to be O(n^2), and this would not scale to my use case (N > 10⁶),

  for N in [50000, 100000, 200000]:
      %time _ = get_group_id(np.random.randint(0, N, size=N))

  CPU times: total: 1.53 s
  CPU times: total: 5.83 s
  CPU times: total: 23.9 s

and I belive using the duplicate linkage information would be more efficient as computing duplicate pairs for N=200000 takes just 6.44 µs in comparison.

Warren Weckesser · Accepted Answer

You could use numpy.unique:

In [13]: x = np.array([1, 0, 0, 3, 3, 0])

In [14]: values, cluster_id = np.unique(x, return_inverse=True)

In [15]: values
Out[15]: array([0, 1, 3])

In [16]: cluster_id
Out[16]: array([1, 0, 0, 2, 2, 0])

(The cluster IDs are assigned in the order of the sorted unique values, not in the order of a value's first appearance in the input.)

Locations of the items in cluster 0:

In [22]: cid = 0

In [23]: values[cid]
Out[23]: 0

In [24]: (cluster_id == cid).nonzero()[0]
Out[24]: array([1, 2, 5])

Efficiently handling duplicates in a Python list

Answers (2)

Related Questions