Reputation: 1346
I have a simple table:
UserID EventTimestamp
... ...
User can have any number of events (from none to lots). I need to run a HiveQL query that would extract the timestamp of the 3rd event (in the ascending order). Having fewer than 3 events should result in null.
Is there a solution that doesn't require emulating rank() using a UDF?
Upvotes: 1
Views: 319
Reputation: 21
This should work. The inner query uses DISTRIBUTE BY and SORT BY to sort the timestamps by userid; then times are popped into an array via collect_set. Then get the 3rd element.
I couldn't make collect_set behave inside sub1, so I structured it thisaway, although I suspect there must be a more efficient way.
SELECT userid, time[2] FROM
(SELECT userid, collect_set(eventtimestamp) time FROM
(SELECT userid , eventtimestamp
FROM myTable
DISTRIBUTE BY userid
SORT BY eventtimestamp ASC
LIMIT 3) sub1
GROUP BY userid ) sub2
GROUP BY userid, time[2]
Upvotes: 1
Reputation: 49049
If you can use a MySQL query, you could use something like this:
SELECT
e1.UserID,
MIN(e3.EventTimestamp) ThirdTimestamp
FROM
Events e1 LEFT JOIN Events e2
ON e1.UserID=e2.UserID and e1.EventTimestamp<e2.EventTimestamp
LEFT JOIN Events e3
ON e1.UserID=e3.UserID and e2.EventTimestamp<e3.EventTimestamp
GROUP BY
UserID
Fiddle here.
Upvotes: 2