kee
kee

Reputation: 11619

How to avoid the same joining for two fields?

I admit the title of this question is not clear. If someone could reword it after reading my question, that will be great.

Anyway I have a pair of fields which are IDs of words. Now I want to replace them by their text. Right now I am doing two joins and foreach like the followings:

WordIDs = LOAD wordID.txt AS (wordID1:long, wordID2:long);
WordTexts = LOAD wordText.txt AS (wordID:long, wordText:chararray);

Join1 = JOIN WordIDs BY wordID1, WordTexts BY wordID;
Replaced1 = FOREACH Join1 GENERATE WordTexts::wordText As wordText1, WordIDs::wordID2;

Join2 = JOIN Replaced1 BY wordID2, WordTexts BY wordID;
Replaced2 = FOREACH Join2 GENERATE Replaced1::wordText1 As wordText1, WordTexts::wordText::wordText2;

Is there any way of doing this with less number of statements (like one join instead of two joins)?

Upvotes: 0

Views: 47

Answers (1)

alexeipab
alexeipab

Reputation: 3619

I think your current code will generate 2 separate map reduce jobs, to avoid it use replicated join, it will not change the number of join statements, but will use just one map side join, only one map reduce job. Code should look like that (I did not run it yet):

WordIDs = LOAD wordID.txt AS (wordID1:long, wordID2:long);
WordTexts = LOAD wordText.txt AS (wordID:long, wordText:chararray);

Join1 = JOIN WordIDs BY wordID1, WordTexts BY wordID USING 'replicated';
Join2 = JOIN Join1 BY wordID2, WordTexts BY wordID USING 'replicated';

Replaced = FOREACH Join2 GENERATE Join1::WordTexts::wordText As wordText1, Join2::wordTexts::wordText as wordText2;

Upvotes: 1

Related Questions