SPARK on EMR Container from a bad node

I am trying to run our pipelines on EMR as steps but I am stuck on this step.

EMR cluster's configuration:

 "logPath" : "s3://ekin-logs/",
 "masterInstanceType" : "m5.xlarge",
 "slaveInstanceType" : "m5.xlarge",
 "instanceCount" : 2,
 "subnetIds" : $SUBNET_ID,
 "ec2KeyName" : "ekin-analytics",
 "applications" : ["Spark","Hadoop"], 

Step parameters:

"args" : [
        "spark-submit",
        "--master", "yarn",
        "--executor-memory", "8G",
        "--driver-memory", "7G",
        "--deploy-mode","cluster",
        "--class","com.testinium.analytics.AppCommonDataSource",
        "--conf","spark.eventLog.enabled=true",
        "s3://analytics-emr-test/ekin-spark-app.jar",
        "--prefixOutputDir", "hdfs:///home/hadoop/data/customer",
        "--maxTimeGapThreshold","180000",
        "--domainId", "13",
        "--submitId", "1",
        "--startTime" ,"1543664538237",
        "--endTime", "1551994119153"
        ],
        "jar" : "command-runner.jar",
        "name" : "AppCommonDataSource",
        "actionOnFailure" : "CANCEL_AND_WAIT"

At first, I was getting the error below:

ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 8.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

I researched on Stack Overflow and found that people solved this problem by adding the yarn.nodemanager.vmem-check-enabled parameter to their yarn-site.xml.(stackoverflow solution)

I also added this parameter but nothing changed.

The yarn-site parameters of my cluster:

    yarnProperties.put("yarn.scheduler.maximum-allocation-mb", 10240);
    yarnProperties.put("yarn.nodemanager.resource.memory-mb", 10240);
    yarnProperties.put("yarn.nodemanager.vmem-check-enabled", "false");
    yarnProperties.put("yarn.nodemanager.pmem-check-enabled", "false");

And this error below is what I get lastly:

ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container from a bad node: container_1591270643256_0002_01_000002 on host: ip-172-31-35-232.eu-west-1.compute.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137

Note: It sometimes works with only a master node.

Upvotes: 1

Views: 15583

Answers (1)

srikanth holur
srikanth holur

Reputation: 780

Add EBS volumes to your nodes. M5 instances doesn't come with any storage.

Upvotes: 0

Related Questions