Reputation: 249
I have a date partitioned Hive table that has one row for every user. It has a column activity_log that has the value either 1 or 0, depending on whether user performed that activity on that date.
I also have a UDF, something like dayOfWeek(), that gives me the day of the week given a date.
I'm trying to create a table that contains user activity for the past one week. So, the columns would be :
user, activity_log_mon, activity_log_tue, activity_log_wed, ...activity_log_sun
Each of the activity_log columns should have the value 1 or 0, indicating whether or not the user performed that activity on that day of the past week.
This is a query that gives me almost what I want :
SELECT user,
IF(dayOfWeek(date)='sun', activity_log , NULL) as activity_log_sun,
IF(dayOfWeek(date)='mon', activity_log , NULL) as activity_log_mon,
IF(dayOfWeek(date)='tue', activity_log , NULL) as activity_log_tue,
IF(dayOfWeek(date)='wed', activity_log , NULL) as activity_log_wed,
IF(dayOfWeek(date)='thu', activity_log , NULL) as activity_log_thu,
IF(dayOfWeek(date)='fri', activity_log , NULL) as activity_log_fri,
IF(dayOfWeek(date)='sat', activity_log , NULL) as activity_log_sat
FROM user_activity_table
WHERE date >= '2015-08-18' AND date <= '2015-08-24'
But this gives 7 rows per user, as follows :
user activity_log_sun activity_log_mon .... activity_log_sat
abcd 1 NULL NULL
abcd NULL 0 NULL
...
abcd NULL NULL 1
What I actually want is a table having only one row for every user, as follows :
user activity_log_sun activity_log_mon .... activity_log_sat
abcd 1 0 1
How can I regroup the rows like this? Or, what's the best way to get the rows like this in the first place?
Upvotes: 1
Views: 3251
Reputation: 249
Here is what I ended up doing :
SELECT user,
SUM(activity_log_sun),
SUM(activity_log_mon),
SUM(activity_log_tue),
SUM(activity_log_wed),
SUM(activity_log_thu),
SUM(activity_log_fri),
SUM(activity_log_sat)
FROM (
SELECT user,
IF(dayOfWeek(date)='sun', activity_log , NULL) as activity_log_sun,
IF(dayOfWeek(date)='mon', activity_log , NULL) as activity_log_mon,
IF(dayOfWeek(date)='tue', activity_log , NULL) as activity_log_tue,
IF(dayOfWeek(date)='wed', activity_log , NULL) as activity_log_wed,
IF(dayOfWeek(date)='thu', activity_log , NULL) as activity_log_thu,
IF(dayOfWeek(date)='fri', activity_log , NULL) as activity_log_fri,
IF(dayOfWeek(date)='sat', activity_log , NULL) as activity_log_sat
FROM user_activity_table
WHERE date >= '2015-08-18' AND date <= '2015-08-24'
) t
GROUP BY user
Upvotes: 1
Reputation: 10428
Observe the behaviour of the following HiveQL:
SELECT COALESCE(collected[0], collected[1], collected[2], collected[3])
FROM(Select Array(NULL, 1, NULL, NULL) as collected) a;
This returns 1
as the first non-null value to the COALESCE
function. Then see there is a grouping function collect_list(col)
.
So, if we call your output with 7 rows per user as activity_uncollected
, your final transformation would be:
SELECT user_id,
COALESCE(collected_mon[0], collected_mon[1], ..., collected_mon[6]),
...
COALESCE(collected_sun[0], collected_sun[1], ..., collected_sun[6])
FROM
(SELECT user_id,
collect_list(activity_log_mon),
...,
collect_list(activity_log_sun)
FROM activity_uncollected
GROUP BY user_id) a;
This groups all the values per-user, per-day, then picks the non-null value from each array.
Upvotes: 0