Reputation: 3752
Hi, I would like to know how to implement lookup logic in Hadoop Pig. I have a set of records, say for a weblog user, and need to go back to fetch some fields from his first visit (not the current).
This is doable in Java but do we have a way to implement this in Hadoop pig.
Example:
Suppose for traversing one particular user, identified by col1
and col2
, output the first value for that user in lookup_col
, in this case '1'.
col1 col2 lookup_col
---- ---- -----
326 8979 1
326 8979 4
326 8979 3
326 8979 0
Upvotes: 2
Views: 760
Reputation: 1177
You can implement this as a pig UDF.
Alternatively, you can also use simple SQL-like logic and aggregate the visits by user (not sure how you define the user and how you plan to look up visit by user but that's another matter) and get the first one and then left-join users with agg_visits.
A 'replicated join' in Pig is essentially a look up in a set that distributed amongst nodes and loaded into memory. However, you can get more than a single result because it's a JOIN operation, and not a lookup - so if you aggregate the data beforehand, you make sure that you only have a single record per key.
Upvotes: 1