Reputation: 177
I have a table that looks like:
usr_id query_ts
12345 2019/05/13 02:06
123444 2019/05/15 04:06
123444 2019/05/16 05:06
12345 2019/05/16 02:06
12345 2019/05/15 02:06
it contains a user ID with when they ran a query. Each entry in the table represents that ID running 1 query at the given timestamp.
I am trying to produce this:
usr_id day_1 day_2 … day_30
12345 31 13 15
123444 23 41 14
I would like to show the number of queries ran each day for the last 30 days for each ID, and if no query was run on that day it will be a 0.
Here is a portion of the query I came up with,
SELECT
t1.usr_id,
case when t1.count_day_1 is null then 0 else t1.count_day_1 end as day_1,
case when t2.count_day_2 is null then 0 else t2.count_day_2 end as day_2
FROM
(SELECT usr_id, DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) as day_1,
COUNT( DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd"))) as count_day_1
FROM db.table
WHERE
DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) = 1
AND
from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")
BETWEEN date_sub(from_unixtime(unix_timestamp()), 30)
AND from_unixtime(unix_timestamp())
GROUP BY usr_id, day_1) t1
LEFT JOIN
(SELECT usr_id, DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) as day_2,
COUNT( DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd"))) as count_day_2
FROM db.table
WHERE
DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) = 2
AND
from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")
BETWEEN date_sub(from_unixtime(unix_timestamp()), 30)
AND from_unixtime(unix_timestamp())
GROUP BY usr_id, day_2) t2
ON (t1.usr_id = t2.usr_id)
ORDER BY t1.usr_id;
This works great, it shows the number of queries ran each day for the first 2 days, and it replaces the NULLs with 0s.
The problem is to get this working for all 30 days I have to use 30 LEFT JOINs which pulls ~400GB+ of memory on the cluster.
Is there an easier way to do this?
Upvotes: 2
Views: 46
Reputation: 38325
Try to do it without join and use current_date, or current_timestamp constants, not unix_timestamp() in the WHERE, this function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant:
select usr_id,
nvl(count(case when from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "dd") = 1 then 1 end),0) as day_1,
nvl(count(case when from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "dd") = 2 then 1 end),0) as day_2
...
from db.table
WHERE
from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")
BETWEEN date_sub(current_date, 30) AND current_date)
group by usr_id
Upvotes: 2