ice_planet_hoth_boss
ice_planet_hoth_boss

Reputation: 31

Apache Pig Group / Flatten / Join

I have 2 datasets.

Set A has a number of events, each occurring on a date, with multiple events per date. Eg:

10/23/2015, event1
10/23/2015, event2
9/17/2014, event3

Set B has a weather observation for each date. There is only one observation per date. Eg:

10/23/2015, obs1
10/22/2015, obs2
9/17/2014, obs3

I would like to attached to each event the weather observation for its respective date, Eg:

10/23/2015, event1, obs1
10/23/2015, event2, obs1
9/17/2014, event3, obs3

I think this can be accomplished by grouping set A by date, doing an inner join with set B by date, and then flattening the result.

Would somebody please let me know if that is the best way, and show me the code to use? Thanks

Upvotes: 0

Views: 215

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 191701

No grouping and flattening needed. Just a join, then you have to remove the duplicated date column.

a = LOAD 'datasetA.txt' USING PigStorage(',') as (date:chararray, evt:chararray);
b = LOAD 'datasetB.txt' USING PigStorage(',') as (date:chararray, obs:chararray);
c_join = a JOIN b ON a.date == b.date;
c = FOREACH c_join GENERATE a::date, a::evt, b::obs;

Output

dump c;
(9/17/2014, event3, obs3)
(10/23/2015, event2, obs1)
(10/23/2015, event1, obs1)

Upvotes: 0

Related Questions