Reputation: 85
import scipy.spatial.distance as dist
Y=[[1,2,3],[2,3,4]]
Q=dist.pdist(Y,'jaccard')
print Q
The following snippet gives jaccard distance as 1
while it should be 0.5
.
On the other hand if Y=[[1,2,3],[4,2,3]]
i.e if ordering is changed output is 0.33. But jaccard distance is independent of order of elements. Can you suggest how to resolve this issue here?
Upvotes: 2
Views: 2013
Reputation: 716
For anyone else with this issue, pdist
appears to compare arrays by index rather than just what objects are present - so the scipy implementation is order dependent, but the input arrays are not treated as boolean arrays (in the sense that [1,2,3]
and [4,5,6]
are not both treated as [True True True]
, unlike the scipy jaccard function).
I had a similar issue and looked at it here:
Why are there discrepanices when generating a distance matrix with scipy pdist(metric = 'jaccard') vs scipy jaccard?
Upvotes: 0
Reputation: 114791
The docstring for the jaccard
function gives a better description of the calculation than the terse summary in the pdist
docstring. jaccard
computes the Jaccard-Needham dissimilarity for boolean arrays. Its behavior for other array types is not defined, so you shouldn't be passing in arrays of arbitrary integers.
Upvotes: 1