Reputation: 354
I have a dataset
Name Subset Type System
A00 IU00-A OP A
A00 IT00 PP A
B01 IT-01A PP B
B01 IU OP B
B03 IM-09-B LP A
B03 IM03A OP B
B03 IT-09 OP A
D09 IT OP B
D09 IM LP B
D09 IM OP A
So here I need to group the Name column such that Subset and Type are similar. We have to only consider the first alphabetical part of the subset column and ignore rest. for eg IM-09-B, IM03A can be considered as IM.
Output needed
Subset Cluster Type Cluster Name System
IU,IT OP,PP A00,B01 A,A,B,B
IM,IM,IT LP, OP, OP B03, D09 A,B,A,B,B,A
Here the first cluster instance is formed coz IU is OP and IT is PP in both cases, similar for the second instance.
Upvotes: 0
Views: 308
Reputation:
You could use groupby
+ agg(join)
twice; once to make groups out of "Names"; then again to make groups out of "Subset"-"Type" pairs. Since we're join
ing strings, we first use sort_values
so that when we join later, each group will have the same order of elements.
out = (df.assign(Subset=df['Subset'].str[:2])
.sort_values(by=df.columns.tolist())
.groupby('Name').agg(', '.join)
.add_suffix(' Cluster').reset_index()
.set_index(['Subset Cluster', 'Type Cluster'])
.groupby(level=[0,1]).agg(', '.join).reset_index())
Output:
Subset Cluster Type Cluster Name
0 IM, IM, IT LP, OP, OP B03, D09
1 IT, IU OP, PP A00, B01
Upvotes: 1