kyrre
kyrre

Reputation: 646

Pig Cross product reducer key

When I perform a crossproduct operation (followed by filtering) the reducer sizes are very imbalanced, with some reducers writing zero output and others taking several hours to complete. A basic example is the following code:

crossproduct = cross tweets, clients;

result = filter crossproduct by text matches CONCAT('.*', CONCAT(keyword, '.*'));

store result into 'result' using PigStorage(' ');

In this case what would be the reducer key?

Upvotes: 0

Views: 1086

Answers (1)

Pradeep Gollakota
Pradeep Gollakota

Reputation: 2181

This is a difficult question to answer. Cross is implemented in Pig as a join on synthetic keys. The best resource to understand cross is Programming Pig - Page 68

In your example, the cross would actually look like

A = foreach tweets generate flatten(GFCross(0,2)), flatten(*);
B = foreach clients generate flatten(GFCross(1,2)), flatten(*);
C = cogroup A by ($0, $1), B by ($0, $1);
crossproduct = foreach C generate flatten(A), flatten(B);

As is explained in the book, GFCross is an internal UDF. The first argument is the input number, and the second argument is the total number of inputs. In your example, the UDF generates records that have a schema of (int, int). The field that is the same as the first argument has a random number between 0 and 3. The other field counts from 0 to 3. So, if you assume that the first record in A has a random number 3, and the first record in B has random number 2, the following 4 tuples are generated by the UDF for each input.

A {(3,0), (3,1), (3,2), (3,3)}
B {(0,2), (1,2), (2,2), (3,2)}

When the join is performed, the (3,2) tuple is joined to the (3,2) tuple in B. For every record in each input, it is guaranteed that there is one and only one instance of the artificial keys that will match and produce a record.

So, to answer your question of what exactly is the reduce key... the reduce key is the synthetic key generated by GFCross. Since the random numbers are chosen differently for each record, the resulting joins should be done on an even distribution of the reducers.

Upvotes: 2

Related Questions