Reputation: 195
A common thing I'm finding I desire in pig is I'll a small relation with data like
A = id, attribute1:int, attribute2:double...
and a large relation with data like
B = id, differentattribute:chararray...
and I'll want to filter B so that all of it's tuple's have an id that's contained in A. I know I could do,
C = JOIN A by id, B by id;
D = FOREACH C GENERATE B::id, B::differentattribute;
but that seems incredibly inefficient. using IN cause with PIG FILTER claims there's no IN clause... if not, is there a more efficient way to mimic IN with a UDF?
Upvotes: 0
Views: 392
Reputation: 10650
If A
fits into the memory, you may have a look at replicated joins :
Fragment replicate join is a special type of join that works well if one or more relations are small enough to fit into main memory. In such cases, Pig can perform a very efficient join [...]
Upvotes: 1