Reputation: 186
I'm running Mahout 0.6 from the command line on an Amazon Elastic MapReduce cluster trying to canopy-cluster ~1500 short documents, and the jobs keep failing with a "Error: Java heap space" message.
Based on previous questions here and elsewhere, I've cranked up every memory knob I can find:
conf/hadoop-env.sh: setting all the heap spaces there up to 1.5GB on small instances and even 4GB on large instances.
conf/mapred-site.xml: adding mapred.{map, reduce}.child.java.opts properties, and setting their value to -Xmx4000m
$MAHOUT_HOME/bin/mahout: increasing JAVA_HEAP_MAX and setting MAHOUT_HEAPSIZE to 6GB (on large instances) as well.
And the problem is persisting. I've been banging my head against this for far too long -- does anyone have any suggestions?
The full command and output look something like this (run on a cluster of Large instances, in hopes that it would alleviate the problem):
hadoop@ip-10-80-202-112:~$ mahout-distribution-0.6/bin/mahout canopy -i sparse-data/2010/tf-vectors -o canopy-out/2010 -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -ow -t1 0.5 -t2 0.005 -cl
run with heapsize 6000
-Xmx6000m
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/hadoop
No HADOOP_CONF_DIR set, using /home/hadoop/conf
MAHOUT-JOB: /home/hadoop/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/04/29 19:50:23 INFO common.AbstractJob: Command line arguments: {--clustering=null, --distanceMeasure=org.apache.mahout.common.distance.TanimotoDistanceMeasure, --endPhase=2147483647, --input=sparse-data/2010/tf-vectors, --method=mapreduce, --output=canopy-out/2010, --overwrite=null, --startPhase=0, --t1=0.5, --t2=0.005, --tempDir=temp}
12/04/29 19:50:24 INFO common.HadoopUtil: Deleting canopy-out/2010
12/04/29 19:50:24 INFO canopy.CanopyDriver: Build Clusters Input: sparse-data/2010/tf-vectors Out: canopy-out/2010 Measure: org.apache.mahout.common.distance.TanimotoDistanceMeasure@a383118 t1: 0.5 t2: 0.0050
12/04/29 19:50:24 INFO mapred.JobClient: Default number of map tasks: null
12/04/29 19:50:24 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 24
12/04/29 19:50:24 INFO mapred.JobClient: Default number of reduce tasks: 1
12/04/29 19:50:25 INFO mapred.JobClient: Setting group to hadoop
12/04/29 19:50:25 INFO input.FileInputFormat: Total input paths to process : 1
12/04/29 19:50:25 INFO mapred.JobClient: Running job: job_201204291846_0004
12/04/29 19:50:26 INFO mapred.JobClient: map 0% reduce 0%
12/04/29 19:50:45 INFO mapred.JobClient: map 27% reduce 0%
[ ... Continues fine until... ]
12/04/29 20:05:54 INFO mapred.JobClient: map 100% reduce 99%
12/04/29 20:06:12 INFO mapred.JobClient: map 100% reduce 0%
12/04/29 20:06:20 INFO mapred.JobClient: Task Id : attempt_201204291846_0004_r_000000_0, Status : FAILED
Error: Java heap space
12/04/29 20:06:41 INFO mapred.JobClient: map 100% reduce 33%
12/04/29 20:06:44 INFO mapred.JobClient: map 100% reduce 68%
[.. REPEAT SEVERAL ITERATIONS, UNITL...]
12/04/29 20:37:58 INFO mapred.JobClient: map 100% reduce 0%
12/04/29 20:38:09 INFO mapred.JobClient: Job complete: job_201204291846_0004
12/04/29 20:38:09 INFO mapred.JobClient: Counters: 23
12/04/29 20:38:09 INFO mapred.JobClient: Job Counters
12/04/29 20:38:09 INFO mapred.JobClient: Launched reduce tasks=4
12/04/29 20:38:09 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=94447
12/04/29 20:38:09 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/04/29 20:38:09 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/04/29 20:38:09 INFO mapred.JobClient: Rack-local map tasks=1
12/04/29 20:38:09 INFO mapred.JobClient: Launched map tasks=1
12/04/29 20:38:09 INFO mapred.JobClient: Failed reduce tasks=1
12/04/29 20:38:09 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=23031
12/04/29 20:38:09 INFO mapred.JobClient: FileSystemCounters
12/04/29 20:38:09 INFO mapred.JobClient: HDFS_BYTES_READ=24100612
12/04/29 20:38:09 INFO mapred.JobClient: FILE_BYTES_WRITTEN=49399745
12/04/29 20:38:09 INFO mapred.JobClient: File Input Format Counters
12/04/29 20:38:09 INFO mapred.JobClient: Bytes Read=24100469
12/04/29 20:38:09 INFO mapred.JobClient: Map-Reduce Framework
12/04/29 20:38:09 INFO mapred.JobClient: Map output materialized bytes=49374728
12/04/29 20:38:09 INFO mapred.JobClient: Combine output records=0
12/04/29 20:38:09 INFO mapred.JobClient: Map input records=409
12/04/29 20:38:09 INFO mapred.JobClient: Physical memory (bytes) snapshot=2785939456
12/04/29 20:38:09 INFO mapred.JobClient: Spilled Records=409
12/04/29 20:38:09 INFO mapred.JobClient: Map output bytes=118596530
12/04/29 20:38:09 INFO mapred.JobClient: CPU time spent (ms)=83190
12/04/29 20:38:09 INFO mapred.JobClient: Total committed heap usage (bytes)=2548629504
12/04/29 20:38:09 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4584386560
12/04/29 20:38:09 INFO mapred.JobClient: Combine input records=0
12/04/29 20:38:09 INFO mapred.JobClient: Map output records=409
12/04/29 20:38:09 INFO mapred.JobClient: SPLIT_RAW_BYTES=143
Exception in thread "main" java.lang.InterruptedException: Canopy Job failed processing sparse-data/2010/tf-vectors
at org.apache.mahout.clustering.canopy.CanopyDriver.buildClustersMR(CanopyDriver.java:349)
at org.apache.mahout.clustering.canopy.CanopyDriver.buildClusters(CanopyDriver.java:236)
at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:145)
at org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:109)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:61)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Upvotes: 3
Views: 3971
Reputation: 361
In a normal situation, you would increase the memory allocation for map/reduce child tasks by setting "mapred.map.child.java.opts" and/or "mapred.reduce.child.java.opts" with something like "-Xmx3g".
However, when you're running things on AWS you have less direct control over these settings. Amazon provides a mechanism for configuring your EMR cluster upon startup called "bootstrap actions".
For memory intensive workflows, i.e. anything Mahout :), check out the "MemoryIntensive" bootstrap.
Upvotes: 3
Reputation: 66886
Your local Hadoop configuration would have nothing to do with how EMR runs, nor would these environment variables. You have to configure EMR itself, and there are not equivalents for some of this. Your worker memory is determined by what kind of instance you ask for, for example.
The error doesn't indicate anything to do with memory. EMR interrupted the job while waiting for it to finish for some reason. Did it fail?
Upvotes: 1