Reputation: 838
Is it possible to specify pig to output 10 r files, the way MR does when it uses 10 reducers? My Pig script outputs just one r file which I guess means it is using just one reducer. I have put
SET default_parallel 10;
in my script and in stderr I can see that at the beginning
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 10
but in the middle of MapReduceLauncher it goes back to
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
I do a join, sum two columns and then compute average of one column and I am suspecting it happens because of avg or group all. Is that correct?
Upvotes: 0
Views: 247
Reputation: 3284
Yes. Qouting from http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#group_by
[...] keep in mind that when using group all, you are necessarily serializing your pipeline. That is, this step and any step after it until you split out the single bag now containing all of your records will not be done in parallel.
Upvotes: 1