Reputation: 2300
I am completely stuck - I would like to count the number of cases an item appears together with another item in a basket, i.e. some correlation:
basket item
0 a x
1 a y
2 a z
3 b x
4 b y
5 c k
6 c y
7 c z
8 d x
9 d y
10 d z
11 d l
That is, items x
, y
, and z
appear together in basket a
, items x
and y
appear together in basket b
, etc.
I am looking for a way to get here:
x y z k l
x 3 2 1
y 3 3 1 1
z 2 3 1 1
k 1 1
l 1 1 1
The matrix is obviously symmetric. The first 3
in the top row means, that x
and y
appear 3 times together in one basket (namely a
, b
and d
), etc.
Thanks a bunch!
Upvotes: 1
Views: 97
Reputation: 862481
Use DataFrame.merge
by same DataFrame with df
, then remove rows with same item
values and last count combinations by crosstab
, last if need order of values like original add DataFrame.reindex
:
df1 = df.merge(df, on='basket')
i = df['item'].unique()
df1 = df1[df1.item_x != df1.item_y]
df2 = pd.crosstab(df1.item_x, df1.item_y).reindex(index=i, columns=i)
print (df2)
item_y x y z k l
item_x
x 0 3 2 0 1
y 3 0 3 1 1
z 2 3 0 1 1
k 0 1 1 0 0
l 1 1 1 0 0
If need missing values use DataFrame.pivot_table
:
df1 = df.merge(df, on='basket')
i = df['item'].unique()
df1 = df1[df1.item_x != df1.item_y]
df2 = df1.pivot_table(index='item_x',
columns='item_y',
aggfunc='size').reindex(index=i, columns=i)
print (df2)
item_y x y z k l
item_x
x NaN 3.0 2.0 NaN 1.0
y 3.0 NaN 3.0 1.0 1.0
z 2.0 3.0 NaN 1.0 1.0
k NaN 1.0 1.0 NaN NaN
l 1.0 1.0 1.0 NaN NaN
Upvotes: 1