Reputation: 375
I have a dataset of sales and purchases on a market place, looking a little like this.
User_ID | Transaction_Type | Date | Amount
1 | Sale | 01/01/14 | 200.00
2 | Purchase | 01/01/14 | 30.00
...
I need to filter out customers who have only bought or sold something versus customers who have bought and sold something at least once.
I am trying to create a function that will check if a user has done both or not. If a user has done both, then the user will be marked as a yes, otherwise no.
So far I have tried doing this,
def user_filter(df):
if df in df['User_ID'].filter(lambda x : ((x['Transaction_Type']=='Sale').any())&((x['Transaction_Type']=='Purchase').any())):
return 'yes'
else:
return 'no'
df['cross'] = df['User_ID'].apply(user_filter)
Let's assume later on in the dataset that User_ID 1 will come back as a Purchase. I would hope it would return as :
User_ID | Transaction_Type | Date | Amount | cross
1 | Sale | 01/01/14 | 200.00 | yes
2 | Purchase | 01/01/14 | 30.00 | no
but the following error returns:
'int' object is not subscriptable
When i apply it to the whole dataframe as opposed to just the series, it returns:
KeyError: ('User_ID', 'occurred at index User_ID')
Upvotes: 0
Views: 103
Reputation: 12406
One possible approach to doing this is using groupby
and then, instead of aggregating, simply list the Transaction_Type
s in each group like show in this SO post. Then, just get the length of the list....if the length is 2 this means both Sale
and Purchase
are present for that user. On the other hand, if the length is 1, then only one of Sale
or Purchase
is present for that user.
Generate some data per the OP (I added a third record to make the output more explicit)
d = [['User_ID', 'Transaction_Type', 'Date', 'Amount'],
[1, 'Sale', '01/01/14', 200],
[1, 'Purchase','01/02/14',300],
[2, 'Purchase','01/01/14',30],]
Perform the GROUP BY
df_users = df.groupby('User_ID')['Transaction_Type'].apply(list).reset_index(drop=False)
df_users.rename(columns={'Transaction_Type':'Transactions'}, inplace=True)
print(df_users)
User_ID Transactions
0 1 [Sale, Purchase]
1 2 [Purchase]
Now append a cross
column to the grouped DataFrame
and populate the cross
column as required
df_users['cross'] = 'no'
df_users.loc[df_users.Transactions.apply(len)==2, 'cross'] = 'yes'
print(df_users)
User_ID Transactions cross
0 1 [Sale, Purchase] yes
1 2 [Purchase] no
EDIT 1
Alternatively, drop the apply
steps and just use size
df_users = df.groupby('User_ID')['Transaction_Type'].size().reset_index(drop=False)
df_users['cross'] = 'no'
df_users.loc[df_users.Transactions==2, 'cross'] = 'yes'
print(df_users)
User_ID Transactions cross
0 1 2 yes
1 2 1 no
EDIT 2
If you wanted to append the cross
column to the source DataFrame
, then add these 2 lines of code to the above
df = df.merge(df_users, on='User_ID')
df.drop(columns=['Transactions'], inplace=True)
print(df)
User_ID Transaction_Type Date Amount cross
0 1 Sale 01/01/14 200 yes
1 1 Purchase 01/02/14 300 yes
2 2 Purchase 01/01/14 30 no
Upvotes: 1