grasshopper
grasshopper

Reputation: 4068

PIG - filtering groupby by the contents of the group

I am new to pig, and I am wondering if I can do any inter-group filtering easily with it. I have some data grouped by userid and some timestamps. I want to take only the groups that have two consecutive timestamps that are less than 30 minutes apart. Is this easy to express in Pig?

Thanks a lot!

Upvotes: 0

Views: 155

Answers (1)

reo katoa
reo katoa

Reputation: 5801

The cleanest way to do this would be to write a UDF. The function would take a bag of timestamps as input, order them, and compute the minimum difference between timestamps. You could then filter your data based on the output of this UDF.

It is possible to do this in pure Pig Latin, if you really want to, although it involves more temporary data and map-reduce cycles, which means it may not be worth it. This would involve FLATTENing the bag of timestamps twice to get its cross-product, creating an indicator variable for any pairs of timestamps separated by less than 30 minutes, and then summing this variable for each user. Any user with a sum greater than zero has the property you desire.

Give it a go, and if you run into any specific issues, post another question outlining exactly where you're stuck.

Upvotes: 1

Related Questions