Reputation: 402
I have a dataframe with every transaction from multiple stores
df_transaction:
client_id purchase_id store value
1 100 A 100
1 101 B 500
2 102 A 200
2 103 C 300
3 104 B 400
3 105 C 100
4 106 A 500
4 107 B 200
4 108 C 300
4 109 A 100
5 110 B 400
5 111 C 100
6 112 D 500
I want to generate all pairs of stores that were visited by the same client
df_pair
store_1 store_2
A B
A C
B C
And then calculate: How many distinct people went to both stores? Total amount people in this group spent in store 1 and store 2? Total transactions from people who went to store 1 and store 2?
The expected output would be:
store_1 store_2 count distinct both sum_store_1 count_store_1 sum_store_2 count_store_2
A B 2 (client_id 1 and 4) 100 + 500 + 100 3 (client 1, 4 500 + 200 2 (client 1, 4)
and 4 again)
A C 2 (client_id 2 and 4) 400 + 200 2 (client 2, 4) 300 + 300 2 (client 2, 4)
B C 3 (client 3, 4 and 5) 400 + 200 + 400 3 (client 3, 4 100+300+100 3 (client 3, 4
and 5) and 5)
Is there a way for me to do this? Maybe a self join with the dataframe I can create the dataframe with the pair of stores joining by client_id, but I'm not sure how to do that or get the count distinct, sum and count.
EDIT: Quang Hoang query was working great, however now I found an issue with the Sum columns. If a client goes to Store A twice and Store B twice we generate a cartesian product and the Sum value will be duplicated in this example
client_id purchase_id store value
1 100 A 10
1 101 A 15
1 102 B 20
1 103 B 25
The expected sum for Store A would be 25 and for Store B would be 45 but what we're getting now is 50 and 90.
Is there any way to avoid that?
Upvotes: 1
Views: 41
Reputation: 402
Quang Hoang answer worked great for the count but for the sum it would duplicate values if a cliente went to store A more than once and also went to store B more than once
I had this in my transaction dataframe
client_id purchase_id store value
1 100 A 10
1 101 A 15
1 102 B 20
1 103 B 25
And the self join would generate this:
client_id store_1 purchase_id_1 value_1 store_2 purchase_id_2 value_2
1 A 100 10 B 102 20
1 A 100 10 B 103 25
1 A 101 15 B 102 20
1 A 101 15 B 103 25
The expected sum for value from store A is 25 and for store B is 45 but we would get 50 and 90
So now I'm calculating the sum for each one separately, dropping the duplicate values before doing the sum and then merging everything together
d = data.merge(right=data, left_on='client_id', right_on='client_id', suffixes=('_1', '_2')).query('store_1 < store_2')
g = d.groupby(['store_1', 'store_2'], as_index=False).agg(
clients_both_Stores = ('client_id', 'nunique'),
count_store_1 = ('purchase_id_1', 'nunique'),
count_store_2 = ('purchase_id_2', 'nunique'))
gv1 = d.drop_duplicates(subset=['client_id', 'store_1', 'store_2', 'purchase_id_1', 'value_1'])\
.groupby(['store_1', 'store_2'], as_index=False)\
.agg(value_store_1 = ('value_1', 'sum'))
gv2 = d.drop_duplicates(subset=['client_id', 'store_1', 'store_2', 'purchase_id_2', 'value_2'])\
.groupby(['store_1', 'store_2'], as_index=False)\
.agg(value_store_2 = ('value_2', 'sum'))
g = g.merge(gv1, left_on=['store_1', 'store_2'], right_on=['store_1', 'store_2'])
g = g.merge(gv2, left_on=['store_1', 'store_2'], right_on=['store_1', 'store_2'])
Upvotes: 0
Reputation: 150735
Try self merge then groupby
:
(df.merge(df, on='client_id')
.query('store_x < store_y')
.groupby(['store_x', 'store_y'], as_index=False)
.agg({'value_x':'sum','value_y':'sum',
'client_id':'nunique'})
)
Output:
store_x store_y value_x value_y client_id
0 A B 700 900 2
1 A C 800 900 2
2 B C 1000 500 3
Update named agg option that matches expected output:
(df.merge(df, on='client_id')
.query('store_x < store_y')
.groupby(['store_x', 'store_y'], as_index=False)
.agg(sum_store_1=('value_x','sum'),
count_store_1=('purchase_id_x','nunique'),
sum_store_2=('value_y','sum'),
count_store_2=('purchase_id_y','nunique'),
distinct_both=('client_id','nunique')
)
)
Output:
store_x store_y sum_store_1 count_store_1 sum_store_2 count_store_2 distinct_both
0 A B 700 3 900 2 2
1 A C 800 3 900 2 2
2 B C 1000 3 500 3 3
Upvotes: 3