Reputation: 61
I have an enormous clickstream data which I can pull the timestamp, user_ids corresponding item_ids that customers clicked.
I am trying to create a dataset that will have the unique customer ids (user_id) on each row and another column that has key:value pair that provides me the list of items and number of times corresponding item is clicked like a python dictionary (or this can be in two columns with items and corresponding num. clicks as a sorted list from most frequent to least). The number of clicks for an item is the times that item_id occur for a fixed user_id. Basically, every occurrence of item is one click. If it occurs 10 times for a fixed user_id, that means that click_count = 10 for that user_id, item_id pair.
Here is an example:
USER_ID | CLICKED_ITEMS |
---|---|
user1 | item3: 20, item2: 10, item1: 5 |
user2 | item1: 10, item2: 1, item3: 1 |
user3 | item2: 1 |
user4 | item1: 10, item2: 2 |
or
USER_ID | CLICKED_ITEMS | NUM_CLICKS |
---|---|---|
user1 | item3, item2, item1 | 20, 10, 5 |
user2 | item1, item2, item3 | 10, 1, 1 |
user3 | item2 | 1 |
user4 | item1, item2 | 0, 2 |
Is this possible using SQL (esp. for Snowflake syntax)? TIA
I tried
SELECT user_id, using ARRAY_UNIQUE_AGG(item_id) AS clicked_items
FROM clickstream_table
GROUP_BY user_id
and got unique user_ids and corresponding list of items but not sure how to find number of times each item is clicked. I can do this easily using pandas but unfortunately I cannot even pull 1 month of data due to size and timeout limitations (3hr limit) on warehouse.
Upvotes: 0
Views: 746
Reputation: 26078
Assuming you know how to aggregate your data to be like the following:
with data (user_id, item_name, click_count) as (
select * from values
('user1', 'item1', 5),
('user1', 'item2', 10),
('user1', 'item3', 20),
('user2', 'item1', 10),
('user2', 'item2', 1),
('user2', 'item3', 1),
('user3', 'item2', 1),
('user4', 'item2', 10),
('user4', 'item2', 2)
)
the results you are after can be done with ARRAY_AGG and OBJECT_CONSTRUCT
select
user_id,
array_agg(object_construct(item_name, click_count)) within group (order by click_count desc) as clicked_items
from data
group by 1
order by 1;
gives:
USER_ID | CLICKED_ITEMS |
---|---|
user1 | [ { "item3": 20 }, { "item2": 10 }, { "item1": 5 } ] |
user2 | [ { "item1": 10 }, { "item2": 1 }, { "item3": 1 } ] |
user3 | [ { "item2": 1 } ] |
user4 | [ { "item2": 10 }, { "item2": 2 } ] |
or if you want it to look "exactly" like you have shown, then the second option can be done with two LISTAGG
select
user_id,
listagg(item_name, ', ') within group (order by click_count desc) as clicked_items,
listagg(click_count, ', ') within group (order by click_count desc) as num_clicks
from data
group by 1
order by 1;
giving:
USER_ID | CLICKED_ITEMS | NUM_CLICKS |
---|---|---|
user1 | item3, item2, item1 | 20, 10, 5 |
user2 | item1, item2, item3 | 10, 1, 1 |
user3 | item2 | 1 |
user4 | item2, item2 | 10, 2 |
well, the first output can be created with string concatenations also.
select
user_id,
listagg(item_name||': '||click_count, ', ') within group (order by click_count desc) as clicked_items
from data
group by 1
order by 1;
gives:
USER_ID | CLICKED_ITEMS |
---|---|
user1 | item3: 20, item2: 10, item1: 5 |
user2 | item1: 10, item2: 1, item3: 1 |
user3 | item2: 1 |
user4 | item2: 10, item2: 2 |
Upvotes: 1