Create distinct cluster for non-ordered sequences in Python

Question

I have a dataframe containing "ID"s of persons in the first column. Each person can have up to 3 other persons that it is assigned into a unique group (or cluster) with. The related "CO_ID"s of a person are stored in 3 other columns. If a person is alone, i.e. there are no other persons assigned to it, it should be considered as it is in an one-person cluster anyway, the values of the other columns beeing nan. Same applies for persons who are only assigned to e.g. one other person: In this case one column contains a "CO_ID" while the other two columns beeing nan.

I wonder how I can assign those (through the CO_ID columns already excatly determined) clusters to each ID via an additional column called "CLUSTER"? Is there a prebuild function for this?

As it is obvious to see from the example data provided, the order of the "CO_ID"s is not important (for ID = ID1, it doesn't matter if CO_ID1 = ID2 and CO_ID2 = ID3 or CO_ID1 = ID3 and CO_ID2 = ID2).

The input data df1 looks like this:

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'ID' : ['ID1','ID2','ID3','ID4','ID5','ID6','ID7','ID8','ID9','ID10'] ,
                    'CO_ID1' : ['ID2','ID1','ID2','ID6','ID8','ID4','ID4','ID5', np.nan, 'ID4'],
                    'CO_ID2' : ['ID3','ID3','ID1', 'ID7', np.nan, 'ID7','ID6',  np.nan, np.nan, 'ID6'],
                    'CO_ID3' : [np.nan, np.nan, np.nan, 'ID10', np.nan, 'ID10', 'ID10', np.nan, np.nan, 'ID7']})

Out[1]: 

  ID CO_ID1 CO_ID2 CO_ID3 
0   ID1    ID2    ID3    NaN      
1   ID2    ID1    ID3    NaN      
2   ID3    ID2    ID1    NaN      
3   ID4    ID6    ID7   ID10      
4   ID5    ID8    NaN    NaN      
5   ID6    ID4    ID7   ID10      
6   ID7    ID4    ID6   ID10      
7   ID8    ID5    NaN    NaN      
8   ID9    NaN    NaN    NaN      
9  ID10    ID4    ID6    ID7

And the desired output data df2 looks like this:

df2 = pd.DataFrame({'ID' : ['ID1','ID2','ID3','ID4','ID5','ID6','ID7','ID8','ID9','ID10'] ,
                    'CO_ID1' : ['ID2','ID1','ID2','ID6','ID8','ID4','ID4','ID5', np.nan, 'ID4'],
                    'CO_ID2' : ['ID3','ID3','ID1', 'ID7', np.nan, 'ID7','ID6',  np.nan, np.nan, 'ID6'],
                    'CO_ID3' : [np.nan, np.nan, np.nan, 'ID10', np.nan, 'ID10', 'ID10', np.nan, np.nan, 'ID7'],
                    'Cluster' : ['C1','C1','C1','C2','C3','C2','C2','C3','C4','C2']})

Out[2]:

  ID CO_ID1 CO_ID2 CO_ID3 Cluster
0   ID1    ID2    ID3    NaN      C1
1   ID2    ID1    ID3    NaN      C1
2   ID3    ID2    ID1    NaN      C1
3   ID4    ID6    ID7   ID10      C2
4   ID5    ID8    NaN    NaN      C3
5   ID6    ID4    ID7   ID10      C2
6   ID7    ID4    ID6   ID10      C2
7   ID8    ID5    NaN    NaN      C3
8   ID9    NaN    NaN    NaN      C4
9  ID10    ID4    ID6    ID7      C2

ALollz · Accepted Answer

Apply frozenset row-wise to create distinct groups which are hashable and ordered (so which row they appear in is irrelevant). Group by these and use ngroup to label each distinct group.

df1['Cluster'] = 'C'+ (df1.groupby(df1.apply(frozenset, 1), sort=False).ngroup()+1).astype('str')

Output

     ID CO_ID1 CO_ID2 CO_ID3 Cluster
0   ID1    ID2    ID3    NaN      C1
1   ID2    ID1    ID3    NaN      C1
2   ID3    ID2    ID1    NaN      C1
3   ID4    ID6    ID7   ID10      C2
4   ID5    ID8    NaN    NaN      C3
5   ID6    ID4    ID7   ID10      C2
6   ID7    ID4    ID6   ID10      C2
7   ID8    ID5    NaN    NaN      C3
8   ID9    NaN    NaN    NaN      C4
9  ID10    ID4    ID6    ID7      C2

If performance is an issue, sort with numpy. We'll need to replace the floating NaN with strings so all values can be compared across columns.

import numpy as np

d = pd.DataFrame(np.sort(df1.replace(np.NaN, 'NaN').values, 1), index=df1.index)
df1['Cluster'] = 'C'+(d.groupby(d.columns.tolist()).ngroup()+1).astype('str')

Create distinct cluster for non-ordered sequences in Python

Answers (1)

Output

Related Questions