Reputation: 31
I have 2 datasets.
Set A has a number of events, each occurring on a date, with multiple events per date. Eg:
10/23/2015, event1
10/23/2015, event2
9/17/2014, event3
Set B has a weather observation for each date. There is only one observation per date. Eg:
10/23/2015, obs1
10/22/2015, obs2
9/17/2014, obs3
I would like to attached to each event the weather observation for its respective date, Eg:
10/23/2015, event1, obs1
10/23/2015, event2, obs1
9/17/2014, event3, obs3
I think this can be accomplished by grouping set A by date, doing an inner join with set B by date, and then flattening the result.
Would somebody please let me know if that is the best way, and show me the code to use? Thanks
Upvotes: 0
Views: 215
Reputation: 191701
No grouping and flattening needed. Just a join, then you have to remove the duplicated date
column.
a = LOAD 'datasetA.txt' USING PigStorage(',') as (date:chararray, evt:chararray);
b = LOAD 'datasetB.txt' USING PigStorage(',') as (date:chararray, obs:chararray);
c_join = a JOIN b ON a.date == b.date;
c = FOREACH c_join GENERATE a::date, a::evt, b::obs;
Output
dump c;
(9/17/2014, event3, obs3)
(10/23/2015, event2, obs1)
(10/23/2015, event1, obs1)
Upvotes: 0