Reputation: 331
I try to perform a simple join in apache pig. The datasets that I use are from http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html
This is what I do in the pig shell:
profiles = LOAD '/user/hadoop/tests/userid-profile.tsv' AS (id,gender,age,country, dreg);
songs = LOAD '/user/hadoop/tests/userid-timestamp-artid-artname-traid-traname.tsv' AS (userID, timestamp, artistID, artistName, trackID, trackName);
prDACH = filter profiles by country=='Germany' or country=='Austria' or country=='Switzerland';
songsDACH = join songs by userID, prDACH by id;
dump songsDACH;
This is a part of the log:
2013-04-20 01:01:33,885 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2013-04-20 01:02:39,802 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 2% complete
2013-04-20 01:13:23,943 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 37% complete
2013-04-20 01:14:48,704 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 39% complete
2013-04-20 01:15:40,166 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 41% complete
2013-04-20 01:15:41,142 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2013-04-20 01:15:41,143 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1366403809583_0023 has failed! Stop running all dependent jobs
2013-04-20 01:15:41,143 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-04-20 01:15:43,117 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1366403809583_0023_m_000019_0 Info:Container killed by the ApplicationMaster.
When I use a small sample of the songs then the join is performed without any problem. Any ideas?
It looks like it is a problem on the hdfs settings, since I can perform the join using a subset of the songs data (100000 samples).
PS I am using the cloudera demo vm.
Upvotes: 1
Views: 1313
Reputation: 1177
You should have a look at the task attempt's log: point your browser at the job tracker (http://[your-jobtracker-node]:50030
), look for the failed job, find a failed task attempt, browse through the log and you'll be able to see the actual exception - I suspect that it may have something to do with task heap size configuration, but you'll have to look at the exception first and then come up with a solution (configuration change, etc.).
Upvotes: 1