François M.
François M.

Reputation: 4278

"Container is running beyond physical memory limits"

I'm working on a job in which Hive queries use R files, distributed on a cluster to be run on each node.

Like that :

ADD FILES hdfs://path/reducers/my_script.R
SET hive.mapred.reduce.tasks.speculative.execution=false;
SET mapred.reduce.tasks = 80;

INSERT OVERWRITE TABLE final_output_table
PARTITION (partition_column1, partition_column2)
SELECT  selected_column1, selected_column2, partition_column1, partition_column2
FROM (
    FROM
      (SELECT input_column1, input_column2, input_column3
       FROM input_table
       WHERE partition_column1 = ${parameter1}
         AND partition_column1 = ${parameter2}
       distribute BY concat(input_column1, partition_column1)) mapped
    REDUCE input_column1, input_column2, input_column3
    USING 'my_script.R'
    AS selected_column1, selected_column2
) reduced

(Hopefully there's no mistake in my reduced code, I'm quite confident there is none in my real code)

Some of the many reduce jobs succeed (17 on my last try, 58 on the previous one), some are killed (64 on the last try, 23 on the previous one), and some fail (31 the on last try, 25 on the previous one).

You'll find the full log of one of the failed reduce attempts at the bottom of the question in case it's needed, but if I'm not mistaken, here are the important parts :

Container [pid=14521, containerID=container_1508303276896_0052_01_000045] is running beyond physical memory limits. 
Current usage: 3.1 GB of 3 GB physical memory used; 6.5 GB of 12 GB virtual memory used. 
Killing container. 
[...]
Container killed on request. 
Exit code is 143 Container exited with a non-zero exit code 143

What I understand : what happens during the maths done in my_script.R takes too much physical memory.

Let's assume that no improvement can be done to the code in my_script.R, and that the way distribute happens can't be anything else.

My question then is : what can I do to avoid taking too much memory ?

Or, maybe (since some reducers succeed) :

In case it's useful :

Average Map Time        1mins, 3sec
Average Shuffle Time    10sec
Average Merge Time      1sec
Average Reduce Time     7mins, 5sec

Full log of one of the failed reduce attempts (from the Hadoop jobs monitoring console, port 8088 and 19888) :

Container [pid=14521,containerID=container_1508303276896_0052_01_000045] is running beyond physical memory limits. 
Current usage: 3.1 GB of 3 GB physical memory used; 6.5 GB of 12 GB virtual memory used. 
Killing container. 
Dump of the process-tree for container_1508303276896_0052_01_000045 : 
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE 
|- 15690 14650 14521 14521 (R) 5978 434 2956750848 559354 /usr/lib/R/bin/exec/R --slave --no-restore --file=/mnt/bi/hadoop_tmp/nm-local-dir/usercache/hadoop/appcache/application_1508303276896_0052/container_1508303276896_0052_01_000045/./my_script.R 
|- 14650 14521 14521 14521 (java) 3837 127 3963912192 262109 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2048m -Djava.io.tmpdir=/mnt/bi/hadoop_tmp/nm-local-dir/usercache/hadoop/appcache/application_1508303276896_0052/container_1508303276896_0052_01_000045/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/mnt/bi/hadoop_tmp/userlogs/application_1508303276896_0052/container_1508303276896_0052_01_000045 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 10.32.128.5 20021 attempt_1508303276896_0052_r_000014_0 45 
|- 14521 20253 14521 14521 (bash) 1 2 13578240 677 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2048m -Djava.io.tmpdir=/mnt/bi/hadoop_tmp/nm-local-dir/usercache/hadoop/appcache/application_1508303276896_0052/container_1508303276896_0052_01_000045/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/mnt/bi/hadoop_tmp/userlogs/application_1508303276896_0052/container_1508303276896_0052_01_000045 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 10.32.128.5 20021 attempt_1508303276896_0052_r_000014_0 45 
1>/mnt/bi/hadoop_tmp/userlogs/application_1508303276896_0052/container_1508303276896_0052_01_000045/stdout 
2>/mnt/bi/hadoop_tmp/userlogs/application_1508303276896_0052/container_1508303276896_0052_01_000045/stderr 
Container killed on request. 
Exit code is 143 Container exited with a non-zero exit code 143

Upvotes: 1

Views: 3729

Answers (2)

Samson Scharfrichter
Samson Scharfrichter

Reputation: 9067

If your Reduce steps are borderline with just 3GB, just give them 4GB...!
set mapreduce.reduce.memory.mb = 4096 ;

Unless you are using TEZ which has a specific property for its generic hive.tez.container.size


For extra information about how YARN manages the memory quotas, see Distcp - Container is running beyond physical memory limits

Upvotes: 2

François M.
François M.

Reputation: 4278

Ok, I'd love more explanation, but in the meantime, here's a trial and error answer :

  • Tried 40 reducers, failed.
  • Tried 160 reducers, succeeded once. Building a few more times to see if it's reliable, I'll update my answer if it was a one time only success.

Upvotes: 0

Related Questions