Mikhail
Mikhail

Reputation: 1346

HiveQL: find Nth value in a one-to-many table

I have a simple table:

UserID  EventTimestamp
...     ...

User can have any number of events (from none to lots). I need to run a HiveQL query that would extract the timestamp of the 3rd event (in the ascending order). Having fewer than 3 events should result in null.

Is there a solution that doesn't require emulating rank() using a UDF?

Upvotes: 1

Views: 319

Answers (2)

spfister
spfister

Reputation: 21

This should work. The inner query uses DISTRIBUTE BY and SORT BY to sort the timestamps by userid; then times are popped into an array via collect_set. Then get the 3rd element.

I couldn't make collect_set behave inside sub1, so I structured it thisaway, although I suspect there must be a more efficient way.

    SELECT userid, time[2] FROM
(SELECT userid, collect_set(eventtimestamp) time FROM
(SELECT userid , eventtimestamp 
FROM myTable
DISTRIBUTE BY userid
SORT BY eventtimestamp ASC 
LIMIT 3) sub1
GROUP BY userid ) sub2
GROUP BY userid, time[2]

Upvotes: 1

fthiella
fthiella

Reputation: 49049

If you can use a MySQL query, you could use something like this:

SELECT
  e1.UserID,
  MIN(e3.EventTimestamp) ThirdTimestamp
FROM
  Events e1 LEFT JOIN Events e2
  ON e1.UserID=e2.UserID and e1.EventTimestamp<e2.EventTimestamp
  LEFT JOIN Events e3
  ON e1.UserID=e3.UserID and e2.EventTimestamp<e3.EventTimestamp
GROUP BY
  UserID

Fiddle here.

Upvotes: 2

Related Questions