David
David

Reputation: 1650

How to do an "Order of Events" query in Hadoop Hive?

I've been learning Hive over the past 2 months, but I'm having trouble figuring out how to do certain sequence based queries. Take this example:

The problem: How would I write a query which asks, "On average, how many times does event A happen before B occurs"?

I know how to group the users and only take users which have done A, and B and average the number of A which occurred, but limiting by the first occurrence of B seems like it would be difficult. I think I might actually be able to do this by stringing together 10 or so nasty looking queries, but I was wanting to know if there was an easier way to do it that I don't know of.

Thanks!

Upvotes: 0

Views: 298

Answers (1)

David Gruzman
David Gruzman

Reputation: 8088

This is not something easy to map to SQL, especially in Hive, taking into account that Hive do not support joins different from equi-joins.
In the same time we can leave a lot of heavy lifting to Hive, while doing the small part by ourselves.
Idea I would suggest is: let hive to group together all events for the specific user. We can create our own agregation UDF which will calculate min date for A and B and give output 0,1,2,3 - if only A happens, Only B Happens, A happens before B , B happens before A.
From there the rest of calculations is again trivial to do in Hive.

Upvotes: 2

Related Questions