Kbbm
Kbbm

Reputation: 375

Applying filter functions on a dataframe

I have a dataset of sales and purchases on a market place, looking a little like this.

User_ID | Transaction_Type |   Date   | Amount
    1   |      Sale        | 01/01/14 | 200.00
    2   |     Purchase     | 01/01/14 |  30.00
    ...

I need to filter out customers who have only bought or sold something versus customers who have bought and sold something at least once.

I am trying to create a function that will check if a user has done both or not. If a user has done both, then the user will be marked as a yes, otherwise no.

So far I have tried doing this,

def user_filter(df):
if df in df['User_ID'].filter(lambda x : ((x['Transaction_Type']=='Sale').any())&((x['Transaction_Type']=='Purchase').any())):
    return 'yes'
else:
    return 'no'
df['cross'] = df['User_ID'].apply(user_filter)

Let's assume later on in the dataset that User_ID 1 will come back as a Purchase. I would hope it would return as :

User_ID | Transaction_Type |   Date   | Amount | cross
    1   |      Sale        | 01/01/14 | 200.00 |  yes
    2   |     Purchase     | 01/01/14 |  30.00 |   no

but the following error returns:

'int' object is not subscriptable

When i apply it to the whole dataframe as opposed to just the series, it returns:

KeyError: ('User_ID', 'occurred at index User_ID')

Upvotes: 0

Views: 103

Answers (1)

edesz
edesz

Reputation: 12406

One possible approach to doing this is using groupby and then, instead of aggregating, simply list the Transaction_Types in each group like show in this SO post. Then, just get the length of the list....if the length is 2 this means both Sale and Purchase are present for that user. On the other hand, if the length is 1, then only one of Sale or Purchase is present for that user.

Generate some data per the OP (I added a third record to make the output more explicit)

d = [['User_ID', 'Transaction_Type', 'Date', 'Amount'],
    [1, 'Sale', '01/01/14', 200],
    [1, 'Purchase','01/02/14',300],
    [2, 'Purchase','01/01/14',30],]

Perform the GROUP BY

df_users = df.groupby('User_ID')['Transaction_Type'].apply(list).reset_index(drop=False)
df_users.rename(columns={'Transaction_Type':'Transactions'}, inplace=True)

print(df_users)
   User_ID      Transactions
0        1  [Sale, Purchase]
1        2        [Purchase]

Now append a cross column to the grouped DataFrame and populate the cross column as required

df_users['cross'] = 'no'
df_users.loc[df_users.Transactions.apply(len)==2, 'cross'] = 'yes'

print(df_users)
   User_ID      Transactions cross
0        1  [Sale, Purchase]   yes
1        2        [Purchase]    no

EDIT 1

Alternatively, drop the apply steps and just use size

df_users = df.groupby('User_ID')['Transaction_Type'].size().reset_index(drop=False)
df_users['cross'] = 'no'
df_users.loc[df_users.Transactions==2, 'cross'] = 'yes'

print(df_users)
   User_ID  Transactions cross
0        1             2   yes
1        2             1    no

EDIT 2

If you wanted to append the cross column to the source DataFrame, then add these 2 lines of code to the above

df = df.merge(df_users, on='User_ID')
df.drop(columns=['Transactions'], inplace=True)

print(df)
   User_ID Transaction_Type      Date  Amount cross
0        1             Sale  01/01/14     200   yes
1        1         Purchase  01/02/14     300   yes
2        2         Purchase  01/01/14      30    no

Upvotes: 1

Related Questions