Nelly Louis
Nelly Louis

Reputation: 167

How to calculate distances between rows in dataframe and create a matrix

I have a dataframe like this

import pandas as pd
sample = pd.DataFrame({'Col1': ['1','0','1','0'],'Col2':['0','0','1','1'],'Col3':['0','0','1','0'],'Class':['A','B','A','B']},index=['Item1','Item2','Item3','Item4'])
In [32]: print(sample)
Out [32]:
      Col1 Col2 Col3 Class
Item1    1    0    0    A
Item2    0    0    0    B
Item3    1    1    1    A
Item4    0    1    0    B

And I want to calculate row distances between differents class' rows. I mean, first of all I would like to calculate distance between rows from classA

       Item1   Item3
Item1  0       0.67
Item3 0.67     0

Secondly, distances between rows from class B

       Item2   Item4
Item2  0       1
Item4  1       0

And lastly distance between different classes.

       Item2   Item4
Item1  1       1
Item3  1       0.67

I have tried calculating distances with DistanceMetric one by one

from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('jacquard')

But I don't know can I do it iterating over the different rows in a large dataframe and create this 3 different matrix wth the distances

Upvotes: 0

Views: 614

Answers (1)

Mohit Motwani
Mohit Motwani

Reputation: 4792

To find distances within Class A and Class B, you can use DataFrame.groupby, (distance used is euclidean):

def find_distance(group):
    return pd.DataFrame(dist.pairwise(group.values))

df.groupby('Class').apply(find_distance)

            0           1
Class           
A      0    0.000000    1.414214
       1    1.414214    0.000000
B      0    0.000000    1.000000
       1    1.000000    0.000000

If you only have two classes, you can separate the two classes into two dataframes and then calculate the difference:

dist_cols = ['Col1', 'Col2','Col3']
df_a = df[df['Class']=='A']
df_b = df[df['Class']=='B']

distances = dist.pairwise(df_a[dist_cols].values, df_b[dist_cols].values)
distances
> array([[1.        , 1.41421356],
       [1.73205081, 1.41421356]])

pd.DataFrame(distances, columns = df_b.index, index = df_a.index)

          Item2       Item4
Item1   1.000000    1.414214
Item3   1.732051    1.414214

Upvotes: 1

Related Questions