Reputation: 167
I have a dataframe like this
import pandas as pd
sample = pd.DataFrame({'Col1': ['1','0','1','0'],'Col2':['0','0','1','1'],'Col3':['0','0','1','0'],'Class':['A','B','A','B']},index=['Item1','Item2','Item3','Item4'])
In [32]: print(sample)
Out [32]:
Col1 Col2 Col3 Class
Item1 1 0 0 A
Item2 0 0 0 B
Item3 1 1 1 A
Item4 0 1 0 B
And I want to calculate row distances between differents class' rows. I mean, first of all I would like to calculate distance between rows from classA
Item1 Item3
Item1 0 0.67
Item3 0.67 0
Secondly, distances between rows from class B
Item2 Item4
Item2 0 1
Item4 1 0
And lastly distance between different classes.
Item2 Item4
Item1 1 1
Item3 1 0.67
I have tried calculating distances with DistanceMetric one by one
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('jacquard')
But I don't know can I do it iterating over the different rows in a large dataframe and create this 3 different matrix wth the distances
Upvotes: 0
Views: 614
Reputation: 4792
To find distances within Class A and Class B, you can use DataFrame.groupby
, (distance used is euclidean
):
def find_distance(group):
return pd.DataFrame(dist.pairwise(group.values))
df.groupby('Class').apply(find_distance)
0 1
Class
A 0 0.000000 1.414214
1 1.414214 0.000000
B 0 0.000000 1.000000
1 1.000000 0.000000
If you only have two classes, you can separate the two classes into two dataframes and then calculate the difference:
dist_cols = ['Col1', 'Col2','Col3']
df_a = df[df['Class']=='A']
df_b = df[df['Class']=='B']
distances = dist.pairwise(df_a[dist_cols].values, df_b[dist_cols].values)
distances
> array([[1. , 1.41421356],
[1.73205081, 1.41421356]])
pd.DataFrame(distances, columns = df_b.index, index = df_a.index)
Item2 Item4
Item1 1.000000 1.414214
Item3 1.732051 1.414214
Upvotes: 1