spd
spd

Reputation: 354

KMeans Clustering using Python

I have a dataset

Name    System
A       AZ
A       NaN
B       AZ
B       NaN
B       NaN
C       AY
C       AY
D       AZ
E       AY
E       AY
E       NaN
F       AZ
F       AZ
F       NaN

Using this dataset, I need to cluster the dataset based on the number of times "System" is repeated for a particular "Name".

In the above example, Names A, B and D have one "AZ" "Subset" while C, E have two "AY" subsets and F has two AZ so it is a different cluster.

Output Example:

Cluster     Names
AZ          A,B
AY,AY       C,E
AZ,AZ       F 

PS. Actual dataset may vary in number of rows and columns How can I do it using ML based clustering algorithms like KNN, Naive Bayes, etc? I need two approaches, one without ignoring NaN, one ignoring NaN.

Upvotes: 1

Views: 196

Answers (1)

mozway
mozway

Reputation: 262484

IIUC, this looks like a double groupby. You first need to group by Name and System and aggregate System to form the clusters. Then this is a simple groupby with aggregation as string.

(df.groupby(['Name', 'System'])
   ['System'].agg(Cluster=','.join)          # clusters of repeats
   .droplevel('System').reset_index()
   .groupby('Cluster')['Name'].agg(','.join) # aggregate by cluster
   .reset_index()
)

output:

  Cluster   Name
0   AY,AY    C,E
1      AZ  A,B,D
2   AZ,AZ      F

NB. I used aggregation with ','.join here but you could also use tuple or a custom function (first groupby) and frozenset (second groupby) to keep access to the individual elements

(df.groupby(['Name', 'System'], dropna=False)
   ['System'].agg(Cluster=lambda x: (x.iloc[0], len(x)))
   .droplevel('System').reset_index()
   .groupby('Cluster')['Name'].agg(frozenset)
   .reset_index()
)

output:

    Cluster       Name
0   (AY, 2)     (E, C)
1   (AZ, 1)  (B, A, D)
2   (AZ, 2)        (F)
3  (nan, 1)  (E, F, A)
4  (nan, 2)        (B)

Upvotes: 1

Related Questions