misterte
misterte

Reputation: 997

Distance matrix for rows in pandas dataframe

I have a pandas dataframe that looks as follows:

In [23]: dataframe.head()
Out[23]: 
column_id   1  10  11  12  13  14  15  16  17  18 ...  46  47  48  49   5  50  \
row_id                                            ...                           
1         NaN NaN   1   1   1   1   1   1   1   1 ...   1   1 NaN   1 NaN NaN   
10          1   1   1   1   1   1   1   1   1 NaN ...   1   1   1 NaN   1 NaN   
100         1   1 NaN   1   1   1   1   1 NaN   1 ... NaN NaN   1   1   1 NaN   
11        NaN   1   1   1   1   1   1   1   1 NaN ... NaN   1   1   1   1   1   
12          1   1   1 NaN   1   1   1   1 NaN   1 ...   1 NaN   1   1 NaN   1   

The thing is I'm currently using the Pearson correlation to calculate similarity between rows, and given the nature of the data, sometimes std deviation is zero (all values are 1 or NaN), so the pearson correlation returns this:

In [24]: dataframe.transpose().corr().head()
Out[24]: 
row_id   1  10  100  11  12  13  14  15  16  17 ...  90  91  92  93  94  95  \
row_id                                          ...                           
1      NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
10     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
100    NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
11     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
12     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN    

Is there any other way of computing correlations that avoids this? Maybe an easy way to calculate the euclidean distance between rows with just one method, just as Pearson correlation has?

Thanks!

A.

Upvotes: 6

Views: 10845

Answers (4)

Michal Skop
Michal Skop

Reputation: 1399

I compared 3 variants from the other answers here for their speed. I had a trial 1000x25 matrix (leading to resulting 1000x1000 matrix)

  1. dcor library

Time: 0.03s

https://dcor.readthedocs.io/en/latest/functions/dcor.distances.pairwise_distances.html

import dcor

result = dcor.distances.pairwise_distances(data)
  1. scipy.distance

Time: 0.05s

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html

from scipy.spatial import distance_matrix

result = distance_matrix(data, data)
  1. using lambda function and numpy or pandas

Time: 180s / 90s

import numpy as np # variant A (180s)
import pandas as pd # variant B (90s)

distance = lambda x, y: np.sqrt(np.sum((x - y) ** 2)) # variant A
distance = lambda x, y: pd.np.linalg.norm(x - y) # variant B

result = data.apply(lambda x: data.apply(lambda y: distance(x, y), axis=1), axis=1)

Upvotes: 0

MyCarta
MyCarta

Reputation: 818

This is my numpy-only version of @S Anand's fantastic answer, which I put together in order to help myself understand his explanation better.

Happy to share it with a short, reproducible example:

# Preliminaries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Get iris dataset into a DataFrame
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

Let's try scipy.stats.pearsonr first.

Executing:

distance = lambda column1, column2: pearsonr(column1, column2)[0]
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: distance(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt

returns: enter image description here

and:

rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: pearsonr(col1, col2)[0], 
                                                               axis = 0, arr=iris_df), 
                              axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np

returns:

array([[1.00, -0.12, 0.87, 0.82, 0.78],
       [-0.12, 1.00, -0.43, -0.37, -0.43],
       [0.87, -0.43, 1.00, 0.96, 0.95],
       [0.82, -0.37, 0.96, 1.00, 0.96],
       [0.78, -0.43, 0.95, 0.96, 1.00]])

As a second example let's try the distance correlation from the dcor library.

Executing:

import dcor
dist_corr = lambda column1, column2: dcor.distance_correlation(column1, column2)
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: dist_corr(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt 

returns: enter image description here

while:

rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, col2), 
                                                               axis = 0, arr=iris_df), 
                              axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np

returns:

array([[1.00, 0.31, 0.86, 0.83, 0.78],
       [0.31, 1.00, 0.54, 0.51, 0.51],
       [0.86, 0.54, 1.00, 0.97, 0.95],
       [0.83, 0.51, 0.97, 1.00, 0.95],
       [0.78, 0.51, 0.95, 0.95, 1.00]])

Upvotes: 0

maparent
maparent

Reputation: 41

A proposal to improve the excellent answer from @s-anand for Euclidian distance: instead of

zero_data = data.fillna(0)
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)

we can apply the fillna the fill only the missing data, thus:

distance = lambda column1, column2: pd.np.linalg.norm((column1 - column2).fillna(0))

This way, the distance on missing dimensions will not be counted.

Upvotes: 2

S Anand
S Anand

Reputation: 11938

The key question here is what distance metric to use.

Let's say this is your data.

>>> import pandas as pd
>>> data = pd.DataFrame(pd.np.random.rand(100, 50))
>>> data[data > 0.2] = 1
>>> data[data <= 0.2] = pd.np.nan
>>> data.head()
   0   1   2   3   4   5   6   7   8   9  ...  40  41  42  43  44  45  46  47  \
0   1   1   1 NaN   1 NaN NaN   1   1   1 ...   1   1 NaN   1 NaN   1   1   1
1   1   1   1 NaN   1   1   1   1   1   1 ... NaN   1   1 NaN NaN   1   1   1
2   1   1   1   1   1   1   1   1   1   1 ...   1 NaN   1   1   1   1   1 NaN
3   1 NaN   1 NaN   1 NaN   1 NaN   1   1 ...   1   1   1   1 NaN   1   1   1
4   1   1   1   1   1   1   1   1 NaN   1 ... NaN   1   1   1   1   1   1   1

What is the % difference?

You can compute a distance metric as percentage of values that are different between each column. The result shows the % difference between any 2 columns.

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: (column1 - column2).abs().sum() / len(column1)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
     0     1     2     3     4     5     6     7     8     9   ...     40  \
0  0.00  0.36  0.33  0.37  0.32  0.41  0.35  0.33  0.39  0.33  ...   0.37
1  0.36  0.00  0.37  0.29  0.30  0.37  0.33  0.37  0.33  0.31  ...   0.35
2  0.33  0.37  0.00  0.36  0.29  0.38  0.40  0.34  0.30  0.28  ...   0.28
3  0.37  0.29  0.36  0.00  0.29  0.30  0.34  0.26  0.32  0.36  ...   0.36
4  0.32  0.30  0.29  0.29  0.00  0.31  0.35  0.29  0.29  0.25  ...   0.27

What is the correlation coefficient?

Here, we use the Pearson correlation coefficient. This is a perfectly valid metric. Specifically, it translates to the phi coefficient in case of binary data.

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: scipy.stats.pearsonr(column1, column2)[0]
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
         0         1         2         3         4         5         6   \
0  1.000000  0.013158  0.026262 -0.059786 -0.024293 -0.078056  0.054074
1  0.013158  1.000000 -0.093109  0.170159  0.043187  0.027425  0.108148
2  0.026262 -0.093109  1.000000 -0.124540 -0.048485 -0.064881 -0.161887
3 -0.059786  0.170159 -0.124540  1.000000  0.004245  0.184153  0.042524
4 -0.024293  0.043187 -0.048485  0.004245  1.000000  0.079196 -0.099834

Incidentally, this is the same result that you would get with the Spearman R coefficient as well.

What is the Euclidean distance?

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
         0         1         2         3         4         5         6   \
0  0.000000  6.000000  5.744563  6.082763  5.656854  6.403124  5.916080
1  6.000000  0.000000  6.082763  5.385165  5.477226  6.082763  5.744563
2  5.744563  6.082763  0.000000  6.000000  5.385165  6.164414  6.324555
3  6.082763  5.385165  6.000000  0.000000  5.385165  5.477226  5.830952
4  5.656854  5.477226  5.385165  5.385165  0.000000  5.567764  5.916080

By now, you'd have a sense of the pattern. Create a distance method. Then apply it pairwise to every column using

data.apply(lambda col1: data.apply(lambda col2: method(col1, col2)))

If your distance method relies on the presence of zeroes instead of nans, convert to zeroes using .fillna(0).

Upvotes: 15

Related Questions