Ajeet Ganga
Ajeet Ganga

Reputation: 8653

PIG how to merge two groups?

What I have

{a, {(1,2),(3,4)}, {(5,6),(7,8)}}

What I want is

{a, {(1,2),(3,4),(5,6),(7,8)}}

I was going through PIG manual but did not find any way of appending two BAGs together.

Ofcourse final solution would be to use some python PDF, but is there a PIG provided way to do it ?

Upvotes: 0

Views: 1257

Answers (2)

mr2ert
mr2ert

Reputation: 5186

There is no builtin function that does this. However, you should be able to do this in pure pig latin, but it is going to be much slower than a UDF of any kind. You'll have to use FLATTEN and UNION like this:

-- A: {key: chararray, vals1: {(one:int, two:int)}, vals2: {(one:int, two:int)}}

B = FOREACH A GENERATE key, FLATTEN(vals1) ;
C = FOREACH A GENERATE key, FLATTEN(vals2) ;

D = UNION B, C ;

-- Group and filter out 'key' from the result bag.
E = FOREACH (GROUP D BY key)
    GENERATE group As key, D.(one, two) AS joined_bag ;

Notice how much uglier this is than a simple python UDF written like:

# Make sure to include the appropriate ouputSchema
def join_bags(BAG1, BAG2):
    return BAG1 + BAG2

And used like:

B = FOREACH A GENERATE key, pythonUDFs.join_bags(vals1, vals2) ;

This would be much simpler if UNION was allowed in nested FOREACHs, but sadly it is not.

Upvotes: 1

matterhayes
matterhayes

Reputation: 458

Check out the BagConcat UDF from DataFu. It does exactly what you want.

Example from the documentation:

define BagConcat datafu.pig.bags.BagConcat();
-- This example illustrates the use on a tuple of bags

-- input:
-- ({(1),(2),(3)},{(3),(4),(5)})
-- ({(20),(25)},{(40),(50)})
input = LOAD 'input' AS (A: bag{T: tuple(v:INT)}, B: bag{T: tuple(v:INT)});

-- output:
-- ({(1),(2),(3),(3),(4),(5)})
-- ({(20),(25),(40),(50)})
output = FOREACH input GENERATE BagConcat(A,B); 

Upvotes: 2

Related Questions