Reputation: 1985
In my Hive table, I have the following:
mydate,user_id,source,timestamp
20160901,10293,google,2016-09-01 00:21:06
20160901,34393,yelp,2016-09-01 05:23:00
20160901,38437,fb,2016-09-01 12:03:24
20160902,34393,google,2016-09-02 13:44:55
...
The source denotes the referral location of the user and can be from yelp, google, fb, bing and so on - this referral denotes what kind of account the user has. The user_id is unique throughout the table and I'm trying to find the distribution of sources over the past year.
I could do something like
sum(IF(source = 'fb',1,0)) fb_count,
sum(IF(source = 'yelp',1,0)) yelp_count
... from my_table where mydate between blah and foo.
However, the complication here is that users can switch their account types. For instance, a user can switch their user type from (say) google to bing after their account is created. So to fix this, I need to select the source corresponding to the first timestamp in the table, which would correspond to the time the user account was created.
Essentially: find the count of each user type where the count is done for the lowest timestamp for that user_id.
The result would look like:
month,fb_count,google_count,yelp_count,bing_count
201601,1667,3403,304,4340
201602,367,343,34,434
...
I could use a self join but that would be pretty expensive. Is there a better way?
Upvotes: 0
Views: 103
Reputation: 49260
You can get the first registered time of a user using row_number
window function and then use conditional aggregation.
select year(mydate),month(mydate),
sum(IF(source = 'fb',1,0)) fb_count,
sum(IF(source = 'yelp',1,0)) yelp_count
from (select t.*, row_number() over(partition by user_id order by timestamp) rn
from tablename t) x
where rn = 1 --add a date filter if necessary
group by year(mydate),month(mydate)
Upvotes: 1