Jon Lawton
Jon Lawton

Reputation: 900

Apache Pig: Filter one tuple on another?

I want to run a Pig script by splitting out two tuples (or whatever it's called in Pig), based off of criteria in col2, and after manipulating col2, into another column, compare the two manipulated tuples and do an additional exclude.

REGISTER /home/user1/piggybank.jar;

log = LOAD '../user2/hadoop_file.txt' AS (col1, col2);

--log = LIMIT log 1000000;
isnt_filtered = FILTER log BY (NOT col2 == 'Some value');
isnt_generated = FOREACH isnt_filtered GENERATE col2, col1, RANDOM() * 1000000 AS random, com.some.valueManipulation(col1) AS isnt_manipulated;

is_filtered = FILTER log BY (col2 == 'Some value');
is_generated = FOREACH is_filtered GENERATE com.some.calculation(col1) AS is_manipulated;
is_distinct = DISTINCT is_generated;

Splitting and manipulating is the easy part. This is where it gets complicated. . .

merge_filtered = FOREACH is_generated {FILTER isnt_generated BY (NOT isnt_manipulated == is_generated.is_manipulated)};

If I can figure out this line(s), the rest would fall in place.

merge_ordered = ORDER merge_filtered BY random, col2, col1;
merge_limited = LIMIT merge_ordered 400000;

STORE merge_limited into 'file';

Here's an example of the I/O:

col1                col2            manipulated
This                qWerty          W
Is                  qweRty          R
An                  qwertY          Y
Example             qwErty          E
Of                  qwerTy          T
Example             Qwerty          Q
Data                qWerty          W


isnt
E
Y


col1                col2
This                qWerty
Is                  qweRty
Of                  qwerTy
Example             Qwerty
Data                qWerty

Upvotes: 0

Views: 3794

Answers (1)

reo katoa
reo katoa

Reputation: 5801

I'm still not sure quite what you need, but I believe you can reproduce your input and output with the following (untested):

data = LOAD 'input' AS (col1:chararray, col2:chararray);
exclude = LOAD 'exclude' AS (excl:chararray);

m = FOREACH data GENERATE col1, col2, YourUDF(col2) AS manipulated;
test = COGROUP m BY manipulated, exclude BY excl;

-- Here you can choose IsEmpty or NOT IsEmpty according to whether you want to exclude or include
final = FOREACH (FILTER test BY IsEmpty(exclude)) GENERATE FLATTEN(m);

With the COGROUP, you group all tuples in each relation by the grouping key. If the bag of tuples from exclude is empty, it means that the grouping key was not present in the exclude list, so you keep tuples from m with that key. Conversely, if the grouping key was present in exclude, that bag will not be empty and the tuples from m with that key will be filtered out.

Upvotes: 2

Related Questions