Salih K
Salih K

Reputation: 61

How to create a key:value pair or 2 sets of lists returning item ids and corresponding number of purchases in sql (or snowflake to be more specific)?

I have an enormous clickstream data which I can pull the timestamp, user_ids corresponding item_ids that customers clicked.

I am trying to create a dataset that will have the unique customer ids (user_id) on each row and another column that has key:value pair that provides me the list of items and number of times corresponding item is clicked like a python dictionary (or this can be in two columns with items and corresponding num. clicks as a sorted list from most frequent to least). The number of clicks for an item is the times that item_id occur for a fixed user_id. Basically, every occurrence of item is one click. If it occurs 10 times for a fixed user_id, that means that click_count = 10 for that user_id, item_id pair.

Here is an example:

USER_ID CLICKED_ITEMS
user1 item3: 20, item2: 10, item1: 5
user2 item1: 10, item2: 1, item3: 1
user3 item2: 1
user4 item1: 10, item2: 2

or

USER_ID CLICKED_ITEMS NUM_CLICKS
user1 item3, item2, item1 20, 10, 5
user2 item1, item2, item3 10, 1, 1
user3 item2 1
user4 item1, item2 0, 2

Is this possible using SQL (esp. for Snowflake syntax)? TIA

I tried

SELECT user_id, using ARRAY_UNIQUE_AGG(item_id) AS clicked_items
FROM clickstream_table
GROUP_BY user_id

and got unique user_ids and corresponding list of items but not sure how to find number of times each item is clicked. I can do this easily using pandas but unfortunately I cannot even pull 1 month of data due to size and timeout limitations (3hr limit) on warehouse.

Upvotes: 0

Views: 746

Answers (1)

Simeon Pilgrim
Simeon Pilgrim

Reputation: 26078

Assuming you know how to aggregate your data to be like the following:

with data (user_id, item_name, click_count) as (
    select * from values
        ('user1', 'item1', 5),
        ('user1', 'item2', 10),
        ('user1', 'item3', 20),
        
        ('user2', 'item1', 10),
        ('user2', 'item2', 1),
        ('user2', 'item3', 1),

        ('user3', 'item2', 1),
        
        ('user4', 'item2', 10),
        ('user4', 'item2', 2)
)

the results you are after can be done with ARRAY_AGG and OBJECT_CONSTRUCT

select 
     user_id, 
     array_agg(object_construct(item_name, click_count)) within group (order by click_count desc) as clicked_items   
from data
group by 1
order by 1;

gives:

USER_ID CLICKED_ITEMS
user1 [ { "item3": 20 }, { "item2": 10 }, { "item1": 5 } ]
user2 [ { "item1": 10 }, { "item2": 1 }, { "item3": 1 } ]
user3 [ { "item2": 1 } ]
user4 [ { "item2": 10 }, { "item2": 2 } ]

or if you want it to look "exactly" like you have shown, then the second option can be done with two LISTAGG

select 
     user_id, 
     listagg(item_name, ', ') within group (order by click_count desc) as clicked_items,
     listagg(click_count, ', ') within group (order by click_count desc) as num_clicks 
from data
group by 1
order by 1;

giving:

USER_ID CLICKED_ITEMS NUM_CLICKS
user1 item3, item2, item1 20, 10, 5
user2 item1, item2, item3 10, 1, 1
user3 item2 1
user4 item2, item2 10, 2

well, the first output can be created with string concatenations also.

select 
     user_id, 
     listagg(item_name||': '||click_count, ', ') within group (order by click_count desc) as clicked_items   
from data
group by 1
order by 1;

gives:

USER_ID CLICKED_ITEMS
user1 item3: 20, item2: 10, item1: 5
user2 item1: 10, item2: 1, item3: 1
user3 item2: 1
user4 item2: 10, item2: 2

Upvotes: 1

Related Questions