Data Clustering using Pandas

Question

I have a dataset

Name    Subset    Type    System
A00     IU00-A    OP      A
A00     IT00      PP      A
B01     IT-01A    PP      B
B01     IU        OP      B
B03     IM-09-B   LP      A
B03     IM03A     OP      B
B03     IT-09     OP      A
D09     IT        OP      B
D09     IM        LP      B
D09     IM        OP      A

So here I need to group the Name column such that Subset and Type are similar. We have to only consider the first alphabetical part of the subset column and ignore rest. for eg IM-09-B, IM03A can be considered as IM.

Output needed

Subset Cluster    Type Cluster    Name        System
IU,IT             OP,PP           A00,B01     A,A,B,B
IM,IM,IT          LP, OP, OP      B03, D09    A,B,A,B,B,A

Here the first cluster instance is formed coz IU is OP and IT is PP in both cases, similar for the second instance.

user7864386 · Accepted Answer

You could use groupby + agg(join) twice; once to make groups out of "Names"; then again to make groups out of "Subset"-"Type" pairs. Since we're joining strings, we first use sort_values so that when we join later, each group will have the same order of elements.

out = (df.assign(Subset=df['Subset'].str[:2])
       .sort_values(by=df.columns.tolist())
       .groupby('Name').agg(', '.join)
       .add_suffix(' Cluster').reset_index()
       .set_index(['Subset Cluster', 'Type Cluster'])
       .groupby(level=[0,1]).agg(', '.join).reset_index())

Output:

  Subset Cluster Type Cluster      Name
0     IM, IM, IT   LP, OP, OP  B03, D09
1         IT, IU       OP, PP  A00, B01

Data Clustering using Pandas

Answers (1)

Related Questions