Reputation: 4482
I have the following dataframe
bootstrap cluster_main cluster_b distance
1 0 Cluster 0 Cluster 1 0.002016
15 0 Cluster 0 Cluster 3 0.001282
4 0 Cluster 1 Cluster 0 0.000772
10 0 Cluster 2 Cluster 2 0.000990
26 1 Cluster 0 Cluster 2 0.001034
16 1 Cluster 2 Cluster 0 0.000159
31 1 Cluster 3 Cluster 3 0.000889
21 1 Cluster 1 Cluster 1 0.000961
35 2 Cluster 0 Cluster 3 0.099427
36 2 Cluster 1 Cluster 0 0.067036
43 2 Cluster 2 Cluster 3 0.102834
45 2 Cluster 3 Cluster 1 0.069814
I would like to find the bootstrap
s for which there is no one-to-one matching between cluster_main
and cluster_b
.
In the above example, the output should be 2
and 0
, because Cluster 3
in cluster_b
column for bootstrap 2
, is "matched" twice and the same happens for Cluster 0
in the cluster_main
column for bootstrap 0
Upvotes: 0
Views: 39
Reputation: 862691
I believe you need:
#compared sorted values
#f = lambda x: sorted(x['cluster_main']) == sorted(x['cluster_b'])
#comppred sets
#f = lambda x: set(x['cluster_main']) == set(x['cluster_b'])
m = df.groupby('bootstrap').apply(f)
print (m)
bootstrap
0 False
1 True
2 False
dtype: bool
bootstrap
0 False
1 True
2 False
dtype: bool
out = m.index[~m]
print (out)
Int64Index([0, 2], dtype='int64', name='bootstrap')
EDIT: I realised first solution was same like comopared sets, so removed.
Here is possible see difference:
print (df)
bootstrap cluster_main cluster_b distance
1 0 Cluster 0 Cluster 1 0.002016
15 0 Cluster 0 Cluster 1 0.001282
4 0 Cluster 1 Cluster 0 0.000772
10 0 Cluster 2 Cluster 2 0.000990
26 1 Cluster 2 Cluster 0 0.001034
16 1 Cluster 0 Cluster 2 0.000159
31 1 Cluster 3 Cluster 3 0.000889
21 1 Cluster 1 Cluster 1 0.000961
35 2 Cluster 0 Cluster 0 0.099427
36 2 Cluster 2 Cluster 2 0.067036
43 2 Cluster 2 Cluster 3 0.102834
45 2 Cluster 3 Cluster 2 0.069814
#compared sorted values
f = lambda x: sorted(x['cluster_main']) == sorted(x['cluster_b'])
m = df.groupby('bootstrap').apply(f)
print (m)
bootstrap
0 False
1 True
2 True
dtype: bool
f = lambda x: set(x['cluster_main']) == set(x['cluster_b'])
m = df.groupby('bootstrap').apply(f)
print (m)
bootstrap
0 True
1 True
2 True
dtype: bool
Upvotes: 1