Reputation: 53
self learner in python, I am trying to improve so any help is very welcome, thanks lot ! I want to compute a jaccard similarity over a column of my dataframe by matching criteria on another column. df looks like this:
name bag number item quantity
sally 1 BANANA 3
sally 2 BREAD 1
franck 3 BANANA 2
franck 3 ORANGE 1
franck 3 BREAD 4
robert 4 ORANGE 3
jenny 5 BANANA 4
jenny 5 ORANGE 2
With about 80 categorical of items, bag number (sample) is unique to one shoper, but they can have more than one and quantities range from 0 to 4. I would like to iterate through bag number to compare the contents with a jaccard similarity or distance of each pair of bag. If possible with the option of considering the quantity as a weight of comparison. the ideal result would be a dataframe like that Python Pandas Distance matrix using jaccard similarity
I feel that the solution is somewher between this > How to compute jaccard similarity from a pandas dataframe and that How to apply a custom function to groups in a dask dataframe, using multiple columns as function input
I am thinking I should iterate through a mask for setting up the two variable of jaccard function. But in every example I see, the items to compare are in different columns. So I am kind of lost, here... thanks lot for helping! cheers
Upvotes: 3
Views: 2288
Reputation: 3001
Tackling the easier, unweighted, version of the problem can be done with the following steps:
create a pivot table with your current dataframe
p = df.pivot_table(
index='bag_number',
columns='item',
values='quantity',
).fillna(0) # Convert NaN to 0
follow the example in your linked question to compute the Jaccard distance with scipy
from scipy.spatial.distance import jaccard, pdist, squareform
m = 1 - squareform(pdist(p.astype(bool), jaccard))
sim = pd.DataFrame(m, index=p.index, columns=p.index)
Result:
bag_number 1 2 3 4 5
bag_number
1 1.000000 0.000000 0.333333 0.000000 0.500000
2 0.000000 1.000000 0.333333 0.000000 0.000000
3 0.333333 0.333333 1.000000 0.333333 0.666667
4 0.000000 0.000000 0.333333 1.000000 0.500000
5 0.500000 0.000000 0.666667 0.500000 1.000000
The weighted version is only slightly more complicated. The pdist
function only supports a vector that it will apply to all comparisons, so you'll need to create a custom similarity (or distance) function. According to Wikipedia, the weighted version can be computed as follows:
import numpy as np
def weighted_jaccard_distance(x, y):
arr = np.array([x, y])
return 1 - arr.min(axis=0).sum() / arr.max(axis=0).sum()
Now you can compute the weighted similarity
sim_weighted = pd.DataFrame(
data=1 - squareform(pdist(p, weighted_jaccard_distance)),
index=p.index,
columns=p.index,
)
Result:
bag_number 1 2 3 4 5
bag_number
1 1.00 0.000000 0.250000 0.000000 0.500000
2 0.00 1.000000 0.142857 0.000000 0.000000
3 0.25 0.142857 1.000000 0.111111 0.300000
4 0.00 0.000000 0.111111 1.000000 0.285714
5 0.50 0.000000 0.300000 0.285714 1.000000
Upvotes: 3