Pig filter if relation contained in a second relation

Question

A common thing I'm finding I desire in pig is I'll a small relation with data like

A = id, attribute1:int, attribute2:double...

and a large relation with data like

B = id, differentattribute:chararray...

and I'll want to filter B so that all of it's tuple's have an id that's contained in A. I know I could do,

C = JOIN A by id, B by id;
D = FOREACH C GENERATE B::id, B::differentattribute;

but that seems incredibly inefficient. using IN cause with PIG FILTER claims there's no IN clause... if not, is there a more efficient way to mimic IN with a UDF?

Lorand Bendig · Accepted Answer

If A fits into the memory, you may have a look at replicated joins :

Fragment replicate join is a special type of join that works well if one or more relations are small enough to fit into main memory. In such cases, Pig can perform a very efficient join [...]

Pig filter if relation contained in a second relation

Answers (1)

Related Questions