Reputation: 2845
I am trying to do a star schema type of join in pig and below is my code. When I join multiple relations with different columns, I have to prefix the name of the previous join every time to get it working. I am sure there should be some better way, I am not able to find it through googling. Any pointers will be very helpful.
i.e prefixing a column like this "H864::H86::hs_8_d::hs_8_desc" is what I want to avoid.
hs_8 = LOAD 'hs_8_distinct' USING PigStorage('^') as (hs_8:chararray,hs_8_desc:chararray);
hs_8_d = FOREACH hs_8 GENERATE SUBSTRING(hs_8,0,2) as hs_2,SUBSTRING(hs_8,0,4) as hs_4,SUBSTRING(hs_8,0,6) as hs_6,hs_8,hs_8_desc;
hs_6_d = LOAD 'hs_6_distinct' USING PigStorage('^') as (hs_6:chararray,hs_6_desc:chararray);
hs_4_d = LOAD 'hs_4_distinct' USING PigStorage('^') as (hs_4:chararray,hs_4_desc:chararray);
hs_2_d = LOAD 'hs_2_distinct' USING PigStorage('^') as (hs_2:chararray,hs_2_desc:chararray);
H86 = JOIN hs_8_d BY hs_6, hs_6_d BY hs_6 USING 'replicated' ;
H864 = JOIN H86 BY hs_8_d::hs_4, hs_4_d BY hs_4 USING 'replicated' ;
H8642 = JOIN H864 BY H86::hs_8_d::hs_2, hs_2_d BY hs_2 USING 'replicated' ;
hs_dim = FOREACH H8642 GENERATE hs_2_d::hs_2,hs_2_d::hs_2_desc,H864::hs_4_d::hs_4,H864::hs_4_d::hs_4_desc,H864::H86::hs_6_d::hs_6,H864::H86::hs_6_d::hs_6_desc,H864::H86::hs_8_d::hs_8,H864::H86::hs_8_d::hs_8_desc;
Upvotes: 1
Views: 658
Reputation: 10650
By adding extra foreach to the joins you can slightly simplify the aliases. Check the statistics, this won't add extra MR jobs to the pipeline. The original and this will yield to 4 map-only jobs.
E.g:
H86 = foreach (JOIN hs_8_d BY hs_6, hs_6_d BY hs_6 USING 'replicated') generate
hs_8_d::hs_2 as x1,
hs_8_d::hs_4 as x2,
hs_8_d::hs_6 as x3,
hs_8_d::hs_8 as x4,
hs_8_d::hs_8_desc as x5,
hs_6_d::hs_6 as x6,
hs_6_d::hs_6_desc as x7;
H864 = foreach (JOIN H86 BY x2, hs_4_d BY hs_4 USING 'replicated') generate
H86::x1 as y1,
H86::x2 as y2,
H86::x3 as y3,
H86::x4 as y4,
H86::x5 as y5,
H86::x6 as y6,
H86::x7 as y7,
hs_4_d::hs_4 as y8,
hs_4_d::hs_4_desc as y9;
H8642 = foreach (JOIN H864 BY y1, hs_2_d BY hs_2 USING 'replicated') generate
H864::y1 as z1,
H864::y2 as z2,
H864::y3 as z3,
H864::y4 as z4,
H864::y5 as z5,
H864::y6 as z6,
H864::y7 as z7,
H864::y8 as z8,
H864::y9 as z9,
hs_2_d::hs_2 as z10,
hs_2_d::hs_2_desc as z11;
hs_dim = FOREACH H8642 GENERATE z10, z11, z8, z9, z6, z7, z4, z5;
If you have a bag of tuples, then Datafu's AliasBagFields may be helpful.
Upvotes: 2
Reputation: 3805
Pig will always prefixes fields with bagname::
to disambiguate fields after joins. I don't think you can avoid this unfortunately.
Upvotes: 0