Reputation: 294
I was wondering how PIG does actually decide how to partition the data in the reduce phase and whether I can influence the data distribution to avoid unbalanced reducer load.
For example:
grouped_data = GROUP data BY (year, month, day) PARALLEL 10;
Is the possible to change the partition for example by: 1.) shuffling the data before the group operation or 2.) changing the order of the variables in the tuple ???
Or do you suggest a different approach?
Thanks in advance!
Upvotes: 1
Views: 180
Reputation: 1855
By default, in most situations, PIG uses Hadoop's default partitioner, which is the HashPartitioner.
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
You can use PARTITION BY
to supply your own strategy.
B = GROUP data BY (year, month, day) PARTITION BY foo.bar.CustomPartitioner PARALLEL 10;
Upvotes: 1