Reputation: 997
I have a pandas dataframe that looks as follows:
In [23]: dataframe.head()
Out[23]:
column_id 1 10 11 12 13 14 15 16 17 18 ... 46 47 48 49 5 50 \
row_id ...
1 NaN NaN 1 1 1 1 1 1 1 1 ... 1 1 NaN 1 NaN NaN
10 1 1 1 1 1 1 1 1 1 NaN ... 1 1 1 NaN 1 NaN
100 1 1 NaN 1 1 1 1 1 NaN 1 ... NaN NaN 1 1 1 NaN
11 NaN 1 1 1 1 1 1 1 1 NaN ... NaN 1 1 1 1 1
12 1 1 1 NaN 1 1 1 1 NaN 1 ... 1 NaN 1 1 NaN 1
The thing is I'm currently using the Pearson correlation to calculate similarity between rows, and given the nature of the data, sometimes std deviation is zero (all values are 1 or NaN), so the pearson correlation returns this:
In [24]: dataframe.transpose().corr().head()
Out[24]:
row_id 1 10 100 11 12 13 14 15 16 17 ... 90 91 92 93 94 95 \
row_id ...
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
Is there any other way of computing correlations that avoids this? Maybe an easy way to calculate the euclidean distance between rows with just one method, just as Pearson correlation has?
Thanks!
A.
Upvotes: 6
Views: 10845
Reputation: 1399
I compared 3 variants from the other answers here for their speed. I had a trial 1000x25 matrix (leading to resulting 1000x1000 matrix)
dcor
libraryTime: 0.03s
https://dcor.readthedocs.io/en/latest/functions/dcor.distances.pairwise_distances.html
import dcor
result = dcor.distances.pairwise_distances(data)
scipy.distance
Time: 0.05s
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html
from scipy.spatial import distance_matrix
result = distance_matrix(data, data)
lambda function
and numpy
or pandas
Time: 180s / 90s
import numpy as np # variant A (180s)
import pandas as pd # variant B (90s)
distance = lambda x, y: np.sqrt(np.sum((x - y) ** 2)) # variant A
distance = lambda x, y: pd.np.linalg.norm(x - y) # variant B
result = data.apply(lambda x: data.apply(lambda y: distance(x, y), axis=1), axis=1)
Upvotes: 0
Reputation: 818
This is my numpy
-only version of @S Anand's fantastic answer, which I put together in order to help myself understand his explanation better.
Happy to share it with a short, reproducible example:
# Preliminaries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Get iris dataset into a DataFrame
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
Let's try scipy.stats.pearsonr
first.
Executing:
distance = lambda column1, column2: pearsonr(column1, column2)[0]
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: distance(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt
and:
rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: pearsonr(col1, col2)[0],
axis = 0, arr=iris_df),
axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np
returns:
array([[1.00, -0.12, 0.87, 0.82, 0.78],
[-0.12, 1.00, -0.43, -0.37, -0.43],
[0.87, -0.43, 1.00, 0.96, 0.95],
[0.82, -0.37, 0.96, 1.00, 0.96],
[0.78, -0.43, 0.95, 0.96, 1.00]])
As a second example let's try the distance correlation from the dcor
library.
Executing:
import dcor
dist_corr = lambda column1, column2: dcor.distance_correlation(column1, column2)
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: dist_corr(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt
while:
rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, col2),
axis = 0, arr=iris_df),
axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np
returns:
array([[1.00, 0.31, 0.86, 0.83, 0.78],
[0.31, 1.00, 0.54, 0.51, 0.51],
[0.86, 0.54, 1.00, 0.97, 0.95],
[0.83, 0.51, 0.97, 1.00, 0.95],
[0.78, 0.51, 0.95, 0.95, 1.00]])
Upvotes: 0
Reputation: 41
A proposal to improve the excellent answer from @s-anand for Euclidian distance: instead of
zero_data = data.fillna(0)
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
we can apply the fillna the fill only the missing data, thus:
distance = lambda column1, column2: pd.np.linalg.norm((column1 - column2).fillna(0))
This way, the distance on missing dimensions will not be counted.
Upvotes: 2
Reputation: 11938
The key question here is what distance metric to use.
Let's say this is your data.
>>> import pandas as pd
>>> data = pd.DataFrame(pd.np.random.rand(100, 50))
>>> data[data > 0.2] = 1
>>> data[data <= 0.2] = pd.np.nan
>>> data.head()
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 \
0 1 1 1 NaN 1 NaN NaN 1 1 1 ... 1 1 NaN 1 NaN 1 1 1
1 1 1 1 NaN 1 1 1 1 1 1 ... NaN 1 1 NaN NaN 1 1 1
2 1 1 1 1 1 1 1 1 1 1 ... 1 NaN 1 1 1 1 1 NaN
3 1 NaN 1 NaN 1 NaN 1 NaN 1 1 ... 1 1 1 1 NaN 1 1 1
4 1 1 1 1 1 1 1 1 NaN 1 ... NaN 1 1 1 1 1 1 1
You can compute a distance metric as percentage of values that are different between each column. The result shows the % difference between any 2 columns.
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: (column1 - column2).abs().sum() / len(column1)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 7 8 9 ... 40 \
0 0.00 0.36 0.33 0.37 0.32 0.41 0.35 0.33 0.39 0.33 ... 0.37
1 0.36 0.00 0.37 0.29 0.30 0.37 0.33 0.37 0.33 0.31 ... 0.35
2 0.33 0.37 0.00 0.36 0.29 0.38 0.40 0.34 0.30 0.28 ... 0.28
3 0.37 0.29 0.36 0.00 0.29 0.30 0.34 0.26 0.32 0.36 ... 0.36
4 0.32 0.30 0.29 0.29 0.00 0.31 0.35 0.29 0.29 0.25 ... 0.27
Here, we use the Pearson correlation coefficient. This is a perfectly valid metric. Specifically, it translates to the phi coefficient in case of binary data.
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: scipy.stats.pearsonr(column1, column2)[0]
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 \
0 1.000000 0.013158 0.026262 -0.059786 -0.024293 -0.078056 0.054074
1 0.013158 1.000000 -0.093109 0.170159 0.043187 0.027425 0.108148
2 0.026262 -0.093109 1.000000 -0.124540 -0.048485 -0.064881 -0.161887
3 -0.059786 0.170159 -0.124540 1.000000 0.004245 0.184153 0.042524
4 -0.024293 0.043187 -0.048485 0.004245 1.000000 0.079196 -0.099834
Incidentally, this is the same result that you would get with the Spearman R coefficient as well.
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 \
0 0.000000 6.000000 5.744563 6.082763 5.656854 6.403124 5.916080
1 6.000000 0.000000 6.082763 5.385165 5.477226 6.082763 5.744563
2 5.744563 6.082763 0.000000 6.000000 5.385165 6.164414 6.324555
3 6.082763 5.385165 6.000000 0.000000 5.385165 5.477226 5.830952
4 5.656854 5.477226 5.385165 5.385165 0.000000 5.567764 5.916080
By now, you'd have a sense of the pattern. Create a distance
method. Then apply it pairwise to every column using
data.apply(lambda col1: data.apply(lambda col2: method(col1, col2)))
If your distance
method relies on the presence of zeroes instead of nan
s, convert to zeroes using .fillna(0)
.
Upvotes: 15