Navneet Kumar
Navneet Kumar

Reputation: 3752

how to implement lookup logic in hadoop pig

Hi, I would like to know how to implement lookup logic in Hadoop Pig. I have a set of records, say for a weblog user, and need to go back to fetch some fields from his first visit (not the current).

This is doable in Java but do we have a way to implement this in Hadoop pig.

Example:

Suppose for traversing one particular user, identified by col1 and col2, output the first value for that user in lookup_col, in this case '1'.

col1  col2  lookup_col
----  ----  -----
326   8979    1
326   8979    4
326   8979    3
326   8979    0

Upvotes: 2

Views: 760

Answers (1)

SNeumann
SNeumann

Reputation: 1177

You can implement this as a pig UDF.

Alternatively, you can also use simple SQL-like logic and aggregate the visits by user (not sure how you define the user and how you plan to look up visit by user but that's another matter) and get the first one and then left-join users with agg_visits.

A 'replicated join' in Pig is essentially a look up in a set that distributed amongst nodes and loaded into memory. However, you can get more than a single result because it's a JOIN operation, and not a lookup - so if you aggregate the data beforehand, you make sure that you only have a single record per key.

Upvotes: 1

Related Questions