quant
quant

Reputation: 4482

How to find the non one-to-one combinations in a pandas dataframe

I have the following dataframe

 bootstrap  cluster_main    cluster_b   distance
    1   0   Cluster 0   Cluster 1   0.002016
    15  0   Cluster 0   Cluster 3   0.001282
    4   0   Cluster 1   Cluster 0   0.000772
    10  0   Cluster 2   Cluster 2   0.000990
    26  1   Cluster 0   Cluster 2   0.001034
    16  1   Cluster 2   Cluster 0   0.000159
    31  1   Cluster 3   Cluster 3   0.000889
    21  1   Cluster 1   Cluster 1   0.000961
    35  2   Cluster 0   Cluster 3   0.099427
    36  2   Cluster 1   Cluster 0   0.067036
    43  2   Cluster 2   Cluster 3   0.102834
    45  2   Cluster 3   Cluster 1   0.069814

I would like to find the bootstraps for which there is no one-to-one matching between cluster_main and cluster_b.

In the above example, the output should be 2 and 0, because Cluster 3 in cluster_b column for bootstrap 2, is "matched" twice and the same happens for Cluster 0 in the cluster_main column for bootstrap 0

Upvotes: 0

Views: 39

Answers (1)

jezrael
jezrael

Reputation: 862691

I believe you need:

#compared sorted values
#f = lambda x: sorted(x['cluster_main']) == sorted(x['cluster_b'])
#comppred sets
#f = lambda x: set(x['cluster_main']) == set(x['cluster_b'])
m = df.groupby('bootstrap').apply(f)
print (m)
bootstrap
0    False
1     True
2    False
dtype: bool
bootstrap
0    False
1     True
2    False
dtype: bool

out = m.index[~m]
print (out)
Int64Index([0, 2], dtype='int64', name='bootstrap')

EDIT: I realised first solution was same like comopared sets, so removed.

Here is possible see difference:

print (df)
    bootstrap cluster_main  cluster_b  distance
1           0    Cluster 0  Cluster 1  0.002016
15          0    Cluster 0  Cluster 1  0.001282
4           0    Cluster 1  Cluster 0  0.000772
10          0    Cluster 2  Cluster 2  0.000990
26          1    Cluster 2  Cluster 0  0.001034
16          1    Cluster 0  Cluster 2  0.000159
31          1    Cluster 3  Cluster 3  0.000889
21          1    Cluster 1  Cluster 1  0.000961
35          2    Cluster 0  Cluster 0  0.099427
36          2    Cluster 2  Cluster 2  0.067036
43          2    Cluster 2  Cluster 3  0.102834
45          2    Cluster 3  Cluster 2  0.069814


#compared sorted values
f = lambda x: sorted(x['cluster_main']) == sorted(x['cluster_b'])
m = df.groupby('bootstrap').apply(f)
print (m)
bootstrap
0    False
1     True
2     True
dtype: bool

f = lambda x: set(x['cluster_main']) == set(x['cluster_b'])
m = df.groupby('bootstrap').apply(f)
print (m)
bootstrap
0    True
1    True
2    True
dtype: bool

Upvotes: 1

Related Questions