tim_xyz
tim_xyz

Reputation: 13551

What does sklearn's pairwise_distances with metric='correlation' do?

I've put different values into this function and observed the output. But I can't find a predictable pattern in what is being outputed.

Then I tried digging through the function itself, but its confusing because it can do a number of different calculations.

According to the Docs:

Compute the distance matrix from a vector array X and optional Y.

I see it returns a matrix of height and width equal to the number of nested lists inputted, implying that it is comparing each one.

But otherwise I'm having a tough time understanding what its doing and where the values are coming from.

Examples I've tried:

pairwise_distances([[1]], metric='correlation')
>>> array([[0.]])

pairwise_distances([[1], [1]], metric='correlation')
>>> array([[ 0., nan],
>>>        [nan,  0.]])

# returns same as last input although input values differ
pairwise_distances([[1], [2]], metric='correlation')
>>> array([[ 0., nan],
>>>        [nan,  0.]])

pairwise_distances([[1,2], [1,2]], metric='correlation')
>>> array([[0.00000000e+00, 2.22044605e-16],
>>>        [2.22044605e-16, 0.00000000e+00]])

# returns same as last input although input values differ
# I incorrectly expected more distance because input values differ more
pairwise_distances([[1,2], [1,3]], metric='correlation')
>>> array([[0.00000000e+00, 2.22044605e-16],
>>>       [2.22044605e-16, 0.00000000e+00]])

Computing correlation distance with Scipy

I don't understand where the sklearn 2.22044605e-16 value is coming from if scipy returns 0.0 for the same inputs.

# Scipy
import scipy
scipy.spatial.distance.correlation([1,2], [1,2])
>>> 0.0

# Sklearn
pairwise_distances([[1,2], [1,2]], metric='correlation')
>>> array([[0.00000000e+00, 2.22044605e-16],
>>>        [2.22044605e-16, 0.00000000e+00]])

I'm not looking for a high level explanation but an example of how the numbers are calculated.

Upvotes: 3

Views: 4065

Answers (4)

Uri Goren
Uri Goren

Reputation: 13700

I totally understand the confusion.

Correlation is calulated on vectors, and sklearn did a non-trivial conversion of a scalar to a vector of size 1.

the result of

from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import correlation
pairwise_distances([u,v,w], metric='correlation')

Is a matrix M of shape (len([u,v,w]),len([u,v,w]))=(3,3), where:

M[0,0] = correlation(u,u)
M[0,1] = correlation(u,v)
M[0,2] = correlation(u,w)
M[1,0] = correlation(v,u)
M[1,1] = correlation(v,v)
M[1,2] = correlation(v,w)
M[2,0] = correlation(w,u)
M[2,1] = correlation(w,v)
M[2,2] = correlation(w,w)

you were looking at correlation([u,v,w], [u,v,w]) that has a valid value only if u ,v and w are scalars.

Upvotes: 1

Venkatachalam
Venkatachalam

Reputation: 16966

pairwise_distances internally call the distance.pdist(), when y is None(which means we want to compute the distance matrix for each vector in X)

Reference 1, 2

The implementation would be similar to the following:

X = np.array([[1,2], [1,2]])

import numpy as np
from numpy.linalg import norm

X2 = X - X.mean(axis=1, keepdims=True)

u, v =[*X2]

1 - (sum(u*v)/(norm(u) * norm(v)))

#2.220446049250313e-16

But scipy.spatial.distance.correlation implementation differs in the latest version

latest version, old version

If we set the weights to None, the following snippet is the simplified version of it:

u, v = np.array([1,2]), np.array([1,2])

umu = np.average(u)
vmu = np.average(v)
u = u - umu
v = v - vmu
uv = np.average(u * v)
uu = np.average(np.square(u))
vv = np.average(np.square(v))
dist = 1.0 - uv / np.sqrt(uu * vv)
dist

#0

Upvotes: 3

Nirmal
Nirmal

Reputation: 1435

import sklearn

X = [[1, 2, 3, 4], [2, 2, 4, 4], [4, 3, 2, 1]]

D = sklearn.metrics.pairwise_distances(X, metric='correlation')
print(D)

Output:

[[0.         0.10557281 2.        ]
 [0.10557281 0.         1.89442719]
 [2.         1.89442719 0.        ]]

D is a distance matrix such that D{i, j} is the distance between the ith and jth vectors of the given matrix X.

import scipy

X = [[1, 2, 3, 4], [2, 2, 4, 4], [4, 3, 2, 1]]

c_00 = scipy.spatial.distance.correlation(X[0], X[0])        # c_00 = 0.0
c_01 = scipy.spatial.distance.correlation(X[0], X[1])        # c_01 = 0.10557280900008414
c_02 = scipy.spatial.distance.correlation(X[0], X[2])        # c_02 = 2.0

I don't understand where the sklearn 2.22044605e-16 value is coming from if scipy returns 0.0 for the same inputs.

This is probably a round-off error.

import numpy as np
epsilon = np.finfo(float).eps
print(epsilon)

Outputs:

2.220446049250313e-16                                    # This value is machine dependent

You could use np.isclose to round extremely small values to 0.

Upvotes: 1

bart cubrich
bart cubrich

Reputation: 1254

The distance metrics can be found here: https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

And correlation is specifically here:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.correlation.html#scipy.spatial.distance.correlation

The correlation distance between u and v, is defined as

enter image description here

Upvotes: 1

Related Questions