eduardoftdo
eduardoftdo

Reputation: 402

Generate pair of columns of stores visited by the same client using a transaction dataframe and calculate aggregated metrics

I have a dataframe with every transaction from multiple stores

df_transaction:

client_id   purchase_id   store   value
1           100           A       100
1           101           B       500

2           102           A       200
2           103           C       300

3           104           B       400
3           105           C       100

4           106           A       500
4           107           B       200
4           108           C       300
4           109           A       100

5           110           B       400
5           111           C       100

6           112           D       500

I want to generate all pairs of stores that were visited by the same client

df_pair

store_1   store_2
A         B
A         C
B         C

And then calculate: How many distinct people went to both stores? Total amount people in this group spent in store 1 and store 2? Total transactions from people who went to store 1 and store 2?

The expected output would be:

store_1 store_2   count distinct both    sum_store_1        count_store_1    sum_store_2  count_store_2     
A       B         2 (client_id 1 and 4)  100 + 500 + 100    3 (client 1, 4   500 + 200    2 (client 1, 4)  
                                                               and 4 again)

A       C         2 (client_id 2 and 4)  400 + 200          2 (client 2, 4)  300 + 300    2 (client 2, 4)

B       C         3 (client 3, 4 and 5)  400 + 200 + 400    3 (client 3, 4   100+300+100  3 (client 3, 4
                                                               and 5)                        and 5)

Is there a way for me to do this? Maybe a self join with the dataframe I can create the dataframe with the pair of stores joining by client_id, but I'm not sure how to do that or get the count distinct, sum and count.

EDIT: Quang Hoang query was working great, however now I found an issue with the Sum columns. If a client goes to Store A twice and Store B twice we generate a cartesian product and the Sum value will be duplicated in this example

client_id   purchase_id   store   value
1           100           A       10
1           101           A       15
1           102           B       20
1           103           B       25

The expected sum for Store A would be 25 and for Store B would be 45 but what we're getting now is 50 and 90.

Is there any way to avoid that?

Upvotes: 1

Views: 41

Answers (2)

eduardoftdo
eduardoftdo

Reputation: 402

Quang Hoang answer worked great for the count but for the sum it would duplicate values if a cliente went to store A more than once and also went to store B more than once

I had this in my transaction dataframe

client_id   purchase_id   store   value
1           100           A       10
1           101           A       15
1           102           B       20
1           103           B       25

And the self join would generate this:

client_id  store_1  purchase_id_1  value_1  store_2  purchase_id_2  value_2
1          A        100            10       B        102            20
1          A        100            10       B        103            25
1          A        101            15       B        102            20
1          A        101            15       B        103            25

The expected sum for value from store A is 25 and for store B is 45 but we would get 50 and 90

So now I'm calculating the sum for each one separately, dropping the duplicate values before doing the sum and then merging everything together

d = data.merge(right=data, left_on='client_id', right_on='client_id', suffixes=('_1', '_2')).query('store_1 < store_2')

g = d.groupby(['store_1', 'store_2'], as_index=False).agg(
          clients_both_Stores = ('client_id', 'nunique'),
          count_store_1 = ('purchase_id_1', 'nunique'),
          count_store_2 = ('purchase_id_2', 'nunique'))
    
gv1 = d.drop_duplicates(subset=['client_id', 'store_1', 'store_2', 'purchase_id_1', 'value_1'])\
            .groupby(['store_1', 'store_2'], as_index=False)\
            .agg(value_store_1 = ('value_1', 'sum'))
    
gv2 = d.drop_duplicates(subset=['client_id', 'store_1', 'store_2', 'purchase_id_2', 'value_2'])\
            .groupby(['store_1', 'store_2'], as_index=False)\
            .agg(value_store_2 = ('value_2', 'sum'))
    
g = g.merge(gv1, left_on=['store_1', 'store_2'], right_on=['store_1', 'store_2'])
g = g.merge(gv2, left_on=['store_1', 'store_2'], right_on=['store_1', 'store_2'])

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150735

Try self merge then groupby:

(df.merge(df, on='client_id')
   .query('store_x < store_y')
   .groupby(['store_x', 'store_y'], as_index=False)
   .agg({'value_x':'sum','value_y':'sum', 
         'client_id':'nunique'})
)

Output:

  store_x store_y  value_x  value_y  client_id
0       A       B      700      900          2
1       A       C      800      900          2
2       B       C     1000      500          3

Update named agg option that matches expected output:

(df.merge(df, on='client_id')
   .query('store_x < store_y')
   .groupby(['store_x', 'store_y'], as_index=False)
   .agg(sum_store_1=('value_x','sum'),
        count_store_1=('purchase_id_x','nunique'),
        sum_store_2=('value_y','sum'), 
        count_store_2=('purchase_id_y','nunique'),
        distinct_both=('client_id','nunique')
       )
)

Output:

  store_x store_y  sum_store_1  count_store_1  sum_store_2  count_store_2  distinct_both
0       A       B          700              3          900              2              2
1       A       C          800              3          900              2              2
2       B       C         1000              3          500              3              3

Upvotes: 3

Related Questions