Reputation: 809
I'm working on a dataset using pandas. The dataset is in the form:
user_id product_id
user1 product1
user2 product3
user1 product2
or maybe this is more clear:
dataset=[[user1,product1], [user2,product3], [user1,product2]]
My goal is to use this dataset to make recommendations for products to buy. I will use Association rules - apriori algorithm for that.
As I don't have a typical dataset of transactions with more than 1 product bought together (in same transaction ID) and I can only work with that dataset, I thought about considering that if the user1 bought product 1 and product 2, then product 1 and product 2 are bought together.
Afterwards, I will create rules from using association rules/apriori algorithm.. but to do that I need the data to be in . the form of :
data=[[product1,product2], [product2], [product3, product1, product2]]
So I need my dataset in the following form:
dataset=[[user1,product1,product2],[user2,product3]]
Afterwards, I can go on further steps to apply apriori.. one-hot encoding, discovering frequent items, etc.
df.groupby(['user_id'])['product_id']
groupby cannot be applied because I have to apply a function .. also pivot function doesn't work.. and these are the only ones that I thought about when trying to do the transformation.
Upvotes: 2
Views: 2696
Reputation: 59519
IIUUC you can get what you want with pd.crosstab
import pandas as pd
df = pd.DataFrame({'user_id': ['user1', 'user2', 'user1', 'user3', 'user3', 'user1', 'user2'],
'product_id': ['milk', 'eggs', 'milk', 'bread', 'butter', 'eggs', 'cheese']})
df1 = pd.crosstab(df.user_id, df.product_id).astype('bool').astype('int')
df1.columns.name=None
df1.index.name=None
df1 is now:
bread butter cheese eggs milk
user1 0 0 0 1 1
user2 0 0 1 1 0
user3 1 1 0 0 0
If you need that list format, you can groupby
+ apply(list)
.
df.groupby('user_id').product_id.apply(list)
#user_id
#user1 [milk, milk, eggs]
#user2 [eggs, cheese]
#user3 [bread, butter]
#Name: product_id, dtype: object
Or if you don't care about duplicates:
df.groupby('user_id').product_id.apply(set)
#user_id
#user1 {milk, eggs}
#user2 {cheese, eggs}
#user3 {bread, butter}
#Name: product_id, dtype: object
Upvotes: 7
Reputation: 1431
this might not be the best solution - maybe someone more experienced can provide a proper pandas solution. I managed to achieve the output you require by doing the following:
# set user_id as index of dataframe
df.set_index('user_id', inplace=True)
dataset=[]
for u in df.index.unique():
data = df.loc[u]['product_id']
data = [data] if isinstance(data, str) else data.tolist()
dataset.append([u]+data)
Output:
[['user1', 'product1', 'product2'], ['user2', 'product3']]
let me know if this answers your question :)
Upvotes: 2