Generate pair of columns of stores visited by the same client using a transaction dataframe and calculate aggregated metrics

Question

I have a dataframe with every transaction from multiple stores

df_transaction:

client_id   purchase_id   store   value
1           100           A       100
1           101           B       500

2           102           A       200
2           103           C       300

3           104           B       400
3           105           C       100

4           106           A       500
4           107           B       200
4           108           C       300
4           109           A       100

5           110           B       400
5           111           C       100

6           112           D       500

I want to generate all pairs of stores that were visited by the same client

df_pair

store_1   store_2
A         B
A         C
B         C

And then calculate: How many distinct people went to both stores? Total amount people in this group spent in store 1 and store 2? Total transactions from people who went to store 1 and store 2?

The expected output would be:

store_1 store_2   count distinct both    sum_store_1        count_store_1    sum_store_2  count_store_2     
A       B         2 (client_id 1 and 4)  100 + 500 + 100    3 (client 1, 4   500 + 200    2 (client 1, 4)  
                                                               and 4 again)

A       C         2 (client_id 2 and 4)  400 + 200          2 (client 2, 4)  300 + 300    2 (client 2, 4)

B       C         3 (client 3, 4 and 5)  400 + 200 + 400    3 (client 3, 4   100+300+100  3 (client 3, 4
                                                               and 5)                        and 5)

Is there a way for me to do this? Maybe a self join with the dataframe I can create the dataframe with the pair of stores joining by client_id, but I'm not sure how to do that or get the count distinct, sum and count.

EDIT: Quang Hoang query was working great, however now I found an issue with the Sum columns. If a client goes to Store A twice and Store B twice we generate a cartesian product and the Sum value will be duplicated in this example

client_id   purchase_id   store   value
1           100           A       10
1           101           A       15
1           102           B       20
1           103           B       25

The expected sum for Store A would be 25 and for Store B would be 45 but what we're getting now is 50 and 90.

Is there any way to avoid that?

Quang Hoang · Accepted Answer

Try self merge then groupby:

(df.merge(df, on='client_id')
   .query('store_x < store_y')
   .groupby(['store_x', 'store_y'], as_index=False)
   .agg({'value_x':'sum','value_y':'sum', 
         'client_id':'nunique'})
)

Output:

  store_x store_y  value_x  value_y  client_id
0       A       B      700      900          2
1       A       C      800      900          2
2       B       C     1000      500          3

Update named agg option that matches expected output:

(df.merge(df, on='client_id')
   .query('store_x < store_y')
   .groupby(['store_x', 'store_y'], as_index=False)
   .agg(sum_store_1=('value_x','sum'),
        count_store_1=('purchase_id_x','nunique'),
        sum_store_2=('value_y','sum'), 
        count_store_2=('purchase_id_y','nunique'),
        distinct_both=('client_id','nunique')
       )
)

Output:

  store_x store_y  sum_store_1  count_store_1  sum_store_2  count_store_2  distinct_both
0       A       B          700              3          900              2              2
1       A       C          800              3          900              2              2
2       B       C         1000              3          500              3              3

Generate pair of columns of stores visited by the same client using a transaction dataframe and calculate aggregated metrics

Answers (2)

Related Questions