Harsh
Harsh

Reputation: 85

Jaccard Distance calculation using pdist in scipy

import scipy.spatial.distance as dist

Y=[[1,2,3],[2,3,4]]

Q=dist.pdist(Y,'jaccard')

print Q

The following snippet gives jaccard distance as 1 while it should be 0.5. On the other hand if Y=[[1,2,3],[4,2,3]] i.e if ordering is changed output is 0.33. But jaccard distance is independent of order of elements. Can you suggest how to resolve this issue here?

Upvotes: 2

Views: 2013

Answers (2)

Tim Kirkwood
Tim Kirkwood

Reputation: 716

For anyone else with this issue, pdist appears to compare arrays by index rather than just what objects are present - so the scipy implementation is order dependent, but the input arrays are not treated as boolean arrays (in the sense that [1,2,3] and [4,5,6] are not both treated as [True True True], unlike the scipy jaccard function).

I had a similar issue and looked at it here:
Why are there discrepanices when generating a distance matrix with scipy pdist(metric = 'jaccard') vs scipy jaccard?

Upvotes: 0

Warren Weckesser
Warren Weckesser

Reputation: 114791

The docstring for the jaccard function gives a better description of the calculation than the terse summary in the pdist docstring. jaccard computes the Jaccard-Needham dissimilarity for boolean arrays. Its behavior for other array types is not defined, so you shouldn't be passing in arrays of arbitrary integers.

Upvotes: 1

Related Questions