ubuntunoob
ubuntunoob

Reputation: 249

Hive QL - select based on condition, grouped on one row

I have a date partitioned Hive table that has one row for every user. It has a column activity_log that has the value either 1 or 0, depending on whether user performed that activity on that date.

I also have a UDF, something like dayOfWeek(), that gives me the day of the week given a date.

I'm trying to create a table that contains user activity for the past one week. So, the columns would be :

user, activity_log_mon, activity_log_tue, activity_log_wed, ...activity_log_sun

Each of the activity_log columns should have the value 1 or 0, indicating whether or not the user performed that activity on that day of the past week.

This is a query that gives me almost what I want :

SELECT user,
IF(dayOfWeek(date)='sun', activity_log , NULL) as activity_log_sun,
IF(dayOfWeek(date)='mon', activity_log , NULL) as activity_log_mon,
IF(dayOfWeek(date)='tue', activity_log , NULL) as activity_log_tue,
IF(dayOfWeek(date)='wed', activity_log , NULL) as activity_log_wed,
IF(dayOfWeek(date)='thu', activity_log , NULL) as activity_log_thu,
IF(dayOfWeek(date)='fri', activity_log , NULL) as activity_log_fri,
IF(dayOfWeek(date)='sat', activity_log , NULL) as activity_log_sat
FROM user_activity_table
WHERE date >= '2015-08-18' AND date <= '2015-08-24'

But this gives 7 rows per user, as follows :

user    activity_log_sun    activity_log_mon   ....   activity_log_sat

abcd          1                  NULL                      NULL 
abcd         NULL                 0                        NULL
... 
abcd         NULL                NULL                       1

What I actually want is a table having only one row for every user, as follows :

user    activity_log_sun    activity_log_mon   ....   activity_log_sat

abcd          1                   0                         1

How can I regroup the rows like this? Or, what's the best way to get the rows like this in the first place?

Upvotes: 1

Views: 3251

Answers (2)

ubuntunoob
ubuntunoob

Reputation: 249

Here is what I ended up doing :

SELECT user, 
       SUM(activity_log_sun),
       SUM(activity_log_mon),
       SUM(activity_log_tue),
       SUM(activity_log_wed),
       SUM(activity_log_thu),
       SUM(activity_log_fri),
       SUM(activity_log_sat)
FROM ( 
SELECT user,
IF(dayOfWeek(date)='sun', activity_log , NULL) as activity_log_sun,
IF(dayOfWeek(date)='mon', activity_log , NULL) as activity_log_mon,
IF(dayOfWeek(date)='tue', activity_log , NULL) as activity_log_tue,
IF(dayOfWeek(date)='wed', activity_log , NULL) as activity_log_wed,
IF(dayOfWeek(date)='thu', activity_log , NULL) as activity_log_thu,
IF(dayOfWeek(date)='fri', activity_log , NULL) as activity_log_fri,
IF(dayOfWeek(date)='sat', activity_log , NULL) as activity_log_sat
FROM user_activity_table
WHERE date >= '2015-08-18' AND date <= '2015-08-24'
) t
GROUP BY user

Upvotes: 1

mattinbits
mattinbits

Reputation: 10428

Observe the behaviour of the following HiveQL:

SELECT COALESCE(collected[0], collected[1], collected[2], collected[3]) 
FROM(Select Array(NULL, 1, NULL, NULL) as collected) a;

This returns 1 as the first non-null value to the COALESCE function. Then see there is a grouping function collect_list(col).

So, if we call your output with 7 rows per user as activity_uncollected, your final transformation would be:

SELECT user_id,
  COALESCE(collected_mon[0], collected_mon[1], ..., collected_mon[6]),
  ...
  COALESCE(collected_sun[0], collected_sun[1], ..., collected_sun[6])
FROM
  (SELECT user_id, 
     collect_list(activity_log_mon), 
     ..., 
     collect_list(activity_log_sun)
   FROM activity_uncollected 
   GROUP BY user_id) a;

This groups all the values per-user, per-day, then picks the non-null value from each array.

Upvotes: 0

Related Questions