Reputation: 648
I have a dataframe containing "ID"s of persons in the first column. Each person can have up to 3 other persons that it is assigned into a unique group (or cluster) with. The related "CO_ID"s of a person are stored in 3 other columns. If a person is alone, i.e. there are no other persons assigned to it, it should be considered as it is in an one-person cluster anyway, the values of the other columns beeing nan. Same applies for persons who are only assigned to e.g. one other person: In this case one column contains a "CO_ID" while the other two columns beeing nan.
I wonder how I can assign those (through the CO_ID columns already excatly determined) clusters to each ID via an additional column called "CLUSTER"? Is there a prebuild function for this?
As it is obvious to see from the example data provided, the order of the "CO_ID"s is not important (for ID = ID1, it doesn't matter if CO_ID1 = ID2 and CO_ID2 = ID3 or CO_ID1 = ID3 and CO_ID2 = ID2).
The input data df1 looks like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'ID' : ['ID1','ID2','ID3','ID4','ID5','ID6','ID7','ID8','ID9','ID10'] ,
'CO_ID1' : ['ID2','ID1','ID2','ID6','ID8','ID4','ID4','ID5', np.nan, 'ID4'],
'CO_ID2' : ['ID3','ID3','ID1', 'ID7', np.nan, 'ID7','ID6', np.nan, np.nan, 'ID6'],
'CO_ID3' : [np.nan, np.nan, np.nan, 'ID10', np.nan, 'ID10', 'ID10', np.nan, np.nan, 'ID7']})
Out[1]:
ID CO_ID1 CO_ID2 CO_ID3
0 ID1 ID2 ID3 NaN
1 ID2 ID1 ID3 NaN
2 ID3 ID2 ID1 NaN
3 ID4 ID6 ID7 ID10
4 ID5 ID8 NaN NaN
5 ID6 ID4 ID7 ID10
6 ID7 ID4 ID6 ID10
7 ID8 ID5 NaN NaN
8 ID9 NaN NaN NaN
9 ID10 ID4 ID6 ID7
And the desired output data df2 looks like this:
df2 = pd.DataFrame({'ID' : ['ID1','ID2','ID3','ID4','ID5','ID6','ID7','ID8','ID9','ID10'] ,
'CO_ID1' : ['ID2','ID1','ID2','ID6','ID8','ID4','ID4','ID5', np.nan, 'ID4'],
'CO_ID2' : ['ID3','ID3','ID1', 'ID7', np.nan, 'ID7','ID6', np.nan, np.nan, 'ID6'],
'CO_ID3' : [np.nan, np.nan, np.nan, 'ID10', np.nan, 'ID10', 'ID10', np.nan, np.nan, 'ID7'],
'Cluster' : ['C1','C1','C1','C2','C3','C2','C2','C3','C4','C2']})
Out[2]:
ID CO_ID1 CO_ID2 CO_ID3 Cluster
0 ID1 ID2 ID3 NaN C1
1 ID2 ID1 ID3 NaN C1
2 ID3 ID2 ID1 NaN C1
3 ID4 ID6 ID7 ID10 C2
4 ID5 ID8 NaN NaN C3
5 ID6 ID4 ID7 ID10 C2
6 ID7 ID4 ID6 ID10 C2
7 ID8 ID5 NaN NaN C3
8 ID9 NaN NaN NaN C4
9 ID10 ID4 ID6 ID7 C2
Upvotes: 0
Views: 126
Reputation: 59519
Apply frozenset
row-wise to create distinct groups which are hashable and ordered (so which row they appear in is irrelevant). Group by these and use ngroup
to label each distinct group.
df1['Cluster'] = 'C'+ (df1.groupby(df1.apply(frozenset, 1), sort=False).ngroup()+1).astype('str')
ID CO_ID1 CO_ID2 CO_ID3 Cluster
0 ID1 ID2 ID3 NaN C1
1 ID2 ID1 ID3 NaN C1
2 ID3 ID2 ID1 NaN C1
3 ID4 ID6 ID7 ID10 C2
4 ID5 ID8 NaN NaN C3
5 ID6 ID4 ID7 ID10 C2
6 ID7 ID4 ID6 ID10 C2
7 ID8 ID5 NaN NaN C3
8 ID9 NaN NaN NaN C4
9 ID10 ID4 ID6 ID7 C2
If performance is an issue, sort with numpy
. We'll need to replace the floating NaN
with strings so all values can be compared across columns.
import numpy as np
d = pd.DataFrame(np.sort(df1.replace(np.NaN, 'NaN').values, 1), index=df1.index)
df1['Cluster'] = 'C'+(d.groupby(d.columns.tolist()).ngroup()+1).astype('str')
Upvotes: 2