Matan
Matan

Reputation: 65

Pig DUMP gets stuck in GROUP

I'm a PIG beginner (using pig 0.10.0) and I have some simple JSON like the following:

test.json:

{
  "from": "1234567890",
  .....
  "profile": {
      "email": "[email protected]"
      .....
  }
}

which i perform some group/counting in pig:

>pig -x local

with the following PIG script:

REGISTER /pig-udfs/oink.jar;
REGISTER /pig-udfs/json-simple-1.1.jar;
REGISTER /pig-udfs/guava-12.0.jar;
REGISTER /pig-udfs/elephant-bird-2.2.3.jar;

users = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') as (json:map[]);

domain_user = FOREACH users GENERATE oink.EmailDomainFilter(json#'profile'#'email') as email, json#'from' as user_id;
DUMP domain_user; /* Outputs: (domain.com,1234567890) */

grouped_domain_user = GROUP domain_user BY email;
DUMP grouped_domain_user; /* Outputs: =stuck here= */

Basically, when i try to dump the grouped_domain_user, pig gets stuck, seemly waiting for a map output to complete:

2012-05-31 17:45:22,111 [Thread-15] INFO  org.apache.hadoop.mapred.Task - Task 'attempt_local_0002_m_000000_0' done.
2012-05-31 17:45:22,119 [Thread-15] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorPlugin : null
2012-05-31 17:45:22,123 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - ShuffleRamManager: MemoryLimit=724828160, MaxSingleShuffleLimit=181207040
2012-05-31 17:45:22,125 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging on-disk files
2012-05-31 17:45:22,128 [Thread for merging in memory files] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging in memory files
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-05-31 17:45:22,129 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
2012-05-31 17:45:22,129 [Thread for polling Map Completion Events] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-05-31 17:45:22,129 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-05-31 17:45:28,118 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:31,122 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:37,123 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:43,124 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:46,124 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:52,126 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:58,127 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:46:01,128 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
.... repeats ....

Suggestions would be welcome on why this is happening.

Thanks!

UPDATE

Chris solved this one for me. I was setting the fs.default.name, etc to correct values in the pig.properties, however i also had the HADOOP_CONF_DIR environment variable set to point to my local Hadoop installation with these sames values set with <final>true</final>.

Great find and much appreciated.

Upvotes: 2

Views: 1600

Answers (1)

Chris White
Chris White

Reputation: 30089

To mark this question as answered, and to those stumbling across this in future:

When running on local mode (whether that be for pig via the pig -x local, or submitting a map reduce job to the local job runner, if you are seeing the reduce phase 'hang', especially if you see entries in the log similar to:

2012-05-31 17:45:22,129 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - 
      attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress

Then your job, although started in local mode, has probably switched to 'clustered' mode because the mapred.job.tracker property is marked as 'final' in your $HADOOP/conf/mapred-site.xml:

<property>
    <name>mapred.job.tracker</name>
    <value>hdfs://localhost:9000</value>
    <final>true</final>
</property>

You should also check the fs.default.name property in core-site.xml too, and ensure it is not marked as final

This means that you are unable to set this value at runtime, and you may even see error messages similar to:

12/05/22 14:28:29 WARN conf.Configuration: 
    file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: fs.default.name;  Ignoring.
12/05/22 14:28:29 WARN conf.Configuration: 
    file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker;  Ignoring.

Upvotes: 3

Related Questions