Reputation: 354
I have a dataset
Name System
A AZ
A NaN
B AZ
B NaN
B NaN
C AY
C AY
D AZ
E AY
E AY
E NaN
F AZ
F AZ
F NaN
Using this dataset, I need to cluster the dataset based on the number of times "System" is repeated for a particular "Name".
In the above example, Names A, B and D have one "AZ" "Subset" while C, E have two "AY" subsets and F has two AZ so it is a different cluster.
Output Example:
Cluster Names
AZ A,B
AY,AY C,E
AZ,AZ F
PS. Actual dataset may vary in number of rows and columns How can I do it using ML based clustering algorithms like KNN, Naive Bayes, etc? I need two approaches, one without ignoring NaN, one ignoring NaN.
Upvotes: 1
Views: 196
Reputation: 262484
IIUC, this looks like a double groupby
. You first need to group by Name and System and aggregate System to form the clusters. Then this is a simple groupby
with aggregation as string.
(df.groupby(['Name', 'System'])
['System'].agg(Cluster=','.join) # clusters of repeats
.droplevel('System').reset_index()
.groupby('Cluster')['Name'].agg(','.join) # aggregate by cluster
.reset_index()
)
output:
Cluster Name
0 AY,AY C,E
1 AZ A,B,D
2 AZ,AZ F
NB. I used aggregation with ','.join
here but you could also use tuple
or a custom function (first groupby
) and frozenset
(second groupby
) to keep access to the individual elements
(df.groupby(['Name', 'System'], dropna=False)
['System'].agg(Cluster=lambda x: (x.iloc[0], len(x)))
.droplevel('System').reset_index()
.groupby('Cluster')['Name'].agg(frozenset)
.reset_index()
)
output:
Cluster Name
0 (AY, 2) (E, C)
1 (AZ, 1) (B, A, D)
2 (AZ, 2) (F)
3 (nan, 1) (E, F, A)
4 (nan, 2) (B)
Upvotes: 1