Apache PIG - How is data distributed among reducers for a GROUP operation

Question

I was wondering how PIG does actually decide how to partition the data in the reduce phase and whether I can influence the data distribution to avoid unbalanced reducer load.

For example:

grouped_data = GROUP data BY (year, month, day) PARALLEL 10;

Is the possible to change the partition for example by: 1.) shuffling the data before the group operation or 2.) changing the order of the variables in the tuple ???

Or do you suggest a different approach?

Thanks in advance!

Jerome Serrano · Accepted Answer

By default, in most situations, PIG uses Hadoop's default partitioner, which is the HashPartitioner.

public int getPartition(K key, V value, int numReduceTasks) {
  return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

You can use PARTITION BY to supply your own strategy.

B = GROUP data BY (year, month, day) PARTITION BY foo.bar.CustomPartitioner PARALLEL 10;

Apache PIG - How is data distributed among reducers for a GROUP operation

Answers (1)

Related Questions