SarahData
SarahData

Reputation: 809

create a basket from a Pandas DataFrame - not standard transaction dataset

I'm working on a dataset using pandas. The dataset is in the form:

user_id product_id

user1 product1

user2 product3

user1 product2

or maybe this is more clear:

dataset=[[user1,product1], [user2,product3], [user1,product2]]

My goal is to use this dataset to make recommendations for products to buy. I will use Association rules - apriori algorithm for that.

As I don't have a typical dataset of transactions with more than 1 product bought together (in same transaction ID) and I can only work with that dataset, I thought about considering that if the user1 bought product 1 and product 2, then product 1 and product 2 are bought together.

Afterwards, I will create rules from using association rules/apriori algorithm.. but to do that I need the data to be in . the form of :

data=[[product1,product2], [product2], [product3, product1, product2]]

So I need my dataset in the following form:

dataset=[[user1,product1,product2],[user2,product3]]

Afterwards, I can go on further steps to apply apriori.. one-hot encoding, discovering frequent items, etc.

df.groupby(['user_id'])['product_id']

groupby cannot be applied because I have to apply a function .. also pivot function doesn't work.. and these are the only ones that I thought about when trying to do the transformation.

Upvotes: 2

Views: 2696

Answers (2)

ALollz
ALollz

Reputation: 59519

IIUUC you can get what you want with pd.crosstab

import pandas as pd
df = pd.DataFrame({'user_id': ['user1', 'user2', 'user1', 'user3', 'user3', 'user1', 'user2'],
                   'product_id': ['milk', 'eggs', 'milk', 'bread', 'butter', 'eggs', 'cheese']})

df1 = pd.crosstab(df.user_id, df.product_id).astype('bool').astype('int')
df1.columns.name=None
df1.index.name=None

df1 is now:

       bread  butter  cheese  eggs  milk
user1      0       0       0     1     1
user2      0       0       1     1     0
user3      1       1       0     0     0

If you need that list format, you can groupby + apply(list).

df.groupby('user_id').product_id.apply(list)
#user_id
#user1    [milk, milk, eggs]
#user2        [eggs, cheese]
#user3       [bread, butter]
#Name: product_id, dtype: object

Or if you don't care about duplicates:

df.groupby('user_id').product_id.apply(set)
#user_id
#user1       {milk, eggs}
#user2     {cheese, eggs}
#user3    {bread, butter}
#Name: product_id, dtype: object

Upvotes: 7

gyx-hh
gyx-hh

Reputation: 1431

this might not be the best solution - maybe someone more experienced can provide a proper pandas solution. I managed to achieve the output you require by doing the following:

# set user_id as index of dataframe
df.set_index('user_id', inplace=True)

dataset=[]
for u in df.index.unique():
    data = df.loc[u]['product_id']
    data = [data] if isinstance(data, str) else data.tolist()
    dataset.append([u]+data)

Output:

[['user1', 'product1', 'product2'], ['user2', 'product3']]

let me know if this answers your question :)

Upvotes: 2

Related Questions