TheSoldier
TheSoldier

Reputation: 482

How can I use Distinct on a group of bags in pig

This is my input as described below:

({(Fish M.),(Fish M.),(Fish M.),(Fish M.),(Fish M.)},{(Acasuso J.),(Acasuso J.),(Acasuso J.),(Acasuso J.),(Acasuso J.)},{(8/23/2007),(8/23/2007),(8/23/2007),(8/23/2007),(8/23/2007)},{(99.84002222685783),(58.173357215875676),(PSL),(41.66666501098216),(EXW)})

I would like to do a distinct on the first and second bags to get one result each to produce an output like this:

(Fish M., Acasuso J., 8/23/2007, 99.84002222685783, 58.173357215875676, PSL, 41.66666501098216, EXW)

Upvotes: 0

Views: 284

Answers (1)

mat77
mat77

Reputation: 436

This script should work, I have ignored the last bag in your entry for brevity.

rr = load 'data/pig/input/Pig_DataSets/six' using CustomLoadFunction() as (one:bag{tup1:(c1:chararray)},two:bag{tup2:(c2:chararray)},three:bag{tup3:(c3:chararray)});
tt = foreach rr {
    mm = two;
    nn = distinct mm;
    oo = one;
    pp = distinct oo;
    generate three,pp,nn;
    };

You might have to use a custom load function because the the default loader wont work (unless you do some data cleansing). This post talks about a custom loader that might fit your scenario.

Upvotes: 2

Related Questions