Reputation: 6432
I have a data frame like this. Invoices are transformed into a table where each row has a single product. This table consists of millions of rows -
invoice | products |
---|---|
INVC1 | xx |
INVC1 | yy |
INVC1 | zz |
INVC1 | aa |
INVC2 | xx |
INVC2 | aa |
INVC2 | bb |
INVC2 | cc |
Now I want to create a table like the one below where I can see how many times each product is purchased with other products-
xx | yy | zz | aa | bb | cc | |
---|---|---|---|---|---|---|
xx | 0 | 1 | 1 | 2 | 1 | 1 |
yy | 1 | 0 | 1 | 1 | 0 | 0 |
zz | 1 | 1 | 0 | 1 | 0 | 0 |
aa | 2 | 1 | 1 | 0 | 1 | 1 |
bb | 1 | 0 | 0 | 1 | 0 | 1 |
cc | 1 | 0 | 0 | 1 | 1 | 0 |
Is there any numpy/pandas way to create a table like this? Or, is there any optimized way to do this.
Upvotes: 1
Views: 66
Reputation: 323316
Let us use pd.crosstab
then dot
s = pd.crosstab(df['invoice'],df['products'])
out = s.T.dot(s)
Out[196]:
products aa bb cc xx yy zz
products
aa 2 1 1 2 1 1
bb 1 1 1 1 0 0
cc 1 1 1 1 0 0
xx 2 1 1 2 1 1
yy 1 0 0 1 1 1
zz 1 0 0 1 1 1
Upvotes: 6