Pig parallel avg

Question

Is it possible to specify pig to output 10 r files, the way MR does when it uses 10 reducers? My Pig script outputs just one r file which I guess means it is using just one reducer. I have put

SET default_parallel 10;

in my script and in stderr I can see that at the beginning

[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 10

but in the middle of MapReduceLauncher it goes back to

[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1

I do a join, sum two columns and then compute average of one column and I am suspecting it happens because of avg or group all. Is that correct?

Frederic · Accepted Answer

Yes. Qouting from http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#group_by

[...] keep in mind that when using group all, you are necessarily serializing your pipeline. That is, this step and any step after it until you split out the single bag now containing all of your records will not be done in parallel.

Pig parallel avg

Answers (1)

Related Questions