David Parks
David Parks

Reputation: 32051

Pig: How to join data on a key in a nested bag

I'm simply trying to merge in the values from data2 to data1 on the 'value1'/'value2' keys seen in both data1 and data2 (note the nested structure of

Easy right? In object oriented code it's a nested for loop. But in Pig it feels like solving a rubix cube.

data1 = 'item1'     111     { ('thing1', 222, {('value1'),('value2')}) }
data2 = 'value1'    'result1'
        'value2'    'result2'

A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );

expected: 'item1', 111, {('thing1', 222, {('value1','result1'), ('value2','result2')})}
                                           ^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^

For the curious: data1 comes from an object oriented datastore, which explains the double nesting (simple object oriented format).

Upvotes: 3

Views: 1466

Answers (2)

Eli
Eli

Reputation: 38899

It sounds like you basically just want to do a join (unclear from the question if this should be INNER, LEFT, RIGHT, or FULL. I think @SNeumann basically has the write answer, but I'll add some code to make it clearer.

Assuming the data looks like:

data1 = 'item1'     111     { ('thing1', 222, {('value1'),('value2')}) }
        ...
data2 = 'value1'    'result1'
        'value2'    'result2'
        ...

I'd do something like (untested):

A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
A_flattened = FOREACH A GENERATE item, d, things.thing AS thing; things.d1 AS d1, FLATTEN(things.values) AS value;
--This looks like:
--'item1', 111, 'thing1', 222, 'value1'
--'item1', 111, 'thing1', 222, 'value2'
A_B_joined = JOIN A_flattened BY value, B BY v;
--This looks like:
--'item1', 111, 'thing1', 222, 'value1', 'value1', 'result1'
--'item1', 111, 'thing1', 222, 'value1', 'value2', 'result2'
A_B_joined1 = FOREACH A_B_JOINED GENERATE item, d, thing, d1, A_flattened::value AS value, r AS result;
A_B_grouped = GROUP A_B_joined1 BY (value, result);

From there, rebagging however you like should be trivial.

EDIT: The above should have used '.' as the projection operator on tuples. I've switched that in. It also assumed things was a big tuple, which it isn't. It's a bag of one item. If the OP never plans to have more than one item in that bag, I'd highly recommend using a tuple instead and loading as:

A = load 'data1' as (item:chararray, d:int, things:(thing:chararray, d1:int, values:bag{(v:chararray)})); 

and then using the rest of the code essentially as is (note: still untested).

If a bag is absolutely required, then the entire problem changes, and it becomes unclear what the OP wants to happen when there are multiple things objects in the bag. Bag projection is also quite a bit more complicated as noted here

Upvotes: 3

SNeumann
SNeumann

Reputation: 1177

I'd try to flatten the bag in A that contains values (1,2), join with B (inner, outer, whatever you're after) and then group again and project the desired structure using TOBAG, etc.

Upvotes: 2

Related Questions