Igor Rivin
Igor Rivin

Reputation: 4864

making of list of clusters from `sklearn` cluster labels output

sklearn's clustering outputs produce lists of labels (where the ith element will be labeled with the cluster it is in. Now, suppose I want a list of clusters. There is a fairly obvious way of doing that:

def clustarrays(labs):
    howmany = np.max(labs)+1
    results = [[] for i in range(howmany)]
    for i in range(len(labs)):
        cnum = labs[i]
        results[cnum].append(i)
    return results


ll = [1, 2, 3, 0, 0, 5, 5, 5]

clustarrays(ll)
[[3, 4], [0], [1], [2], [], [5, 6, 7]]

This is all very well and good, but this will be very slow for large datasets. Is there a more numpy-centric way of doing this?

Upvotes: 1

Views: 250

Answers (1)

Ehsan
Ehsan

Reputation: 12397

If you want pure numpy use:

def clustarrays(labs):
    return np.split(np.argsort(labs),np.unique(labs,return_counts=True)[1].cumsum())

output:

[array([3, 4]), array([0]), array([1]), array([2]), array([5, 6, 7]), array([], dtype=int64)]

I would suggest pandas:

import pandas as pd
def clustarrays(labs):
  df = pd.DataFrame({'labs':labs})
  return df.groupby(df.labs).groups

output:

{0: Int64Index([3, 4], dtype='int64'), 1: Int64Index([0], dtype='int64'), 2: Int64Index([1], dtype='int64'), 3: Int64Index([2], dtype='int64'), 5: Int64Index([5, 6, 7], dtype='int64')}

Upvotes: 1

Related Questions