Reputation: 900
I want to run a Pig script by splitting out two tuples (or whatever it's called in Pig), based off of criteria in col2
, and after manipulating col2
, into another column, compare the two manipulated tuples and do an additional exclude.
REGISTER /home/user1/piggybank.jar;
log = LOAD '../user2/hadoop_file.txt' AS (col1, col2);
--log = LIMIT log 1000000;
isnt_filtered = FILTER log BY (NOT col2 == 'Some value');
isnt_generated = FOREACH isnt_filtered GENERATE col2, col1, RANDOM() * 1000000 AS random, com.some.valueManipulation(col1) AS isnt_manipulated;
is_filtered = FILTER log BY (col2 == 'Some value');
is_generated = FOREACH is_filtered GENERATE com.some.calculation(col1) AS is_manipulated;
is_distinct = DISTINCT is_generated;
Splitting and manipulating is the easy part. This is where it gets complicated. . .
merge_filtered = FOREACH is_generated {FILTER isnt_generated BY (NOT isnt_manipulated == is_generated.is_manipulated)};
If I can figure out this line(s), the rest would fall in place.
merge_ordered = ORDER merge_filtered BY random, col2, col1;
merge_limited = LIMIT merge_ordered 400000;
STORE merge_limited into 'file';
Here's an example of the I/O:
col1 col2 manipulated
This qWerty W
Is qweRty R
An qwertY Y
Example qwErty E
Of qwerTy T
Example Qwerty Q
Data qWerty W
isnt
E
Y
col1 col2
This qWerty
Is qweRty
Of qwerTy
Example Qwerty
Data qWerty
Upvotes: 0
Views: 3794
Reputation: 5801
I'm still not sure quite what you need, but I believe you can reproduce your input and output with the following (untested):
data = LOAD 'input' AS (col1:chararray, col2:chararray);
exclude = LOAD 'exclude' AS (excl:chararray);
m = FOREACH data GENERATE col1, col2, YourUDF(col2) AS manipulated;
test = COGROUP m BY manipulated, exclude BY excl;
-- Here you can choose IsEmpty or NOT IsEmpty according to whether you want to exclude or include
final = FOREACH (FILTER test BY IsEmpty(exclude)) GENERATE FLATTEN(m);
With the COGROUP
, you group all tuples in each relation by the grouping key. If the bag of tuples from exclude
is empty, it means that the grouping key was not present in the exclude list, so you keep tuples from m
with that key. Conversely, if the grouping key was present in exclude
, that bag will not be empty and the tuples from m
with that key will be filtered out.
Upvotes: 2