jordi
jordi

Reputation: 1187

yarn java process not killed

I have installed Apache Samza, that uses Yarn to manage the jobs. It is running on two Debian servers on virtual machines. Samza is version 0.9.1. Hadoop is version 2.6.0. I am seeing two different problems that I am not sure if they are related, but both look like Yarn is not doing what it should.

yarn-site.xml:

<configuration>
<property>
 <name>yarn.resourcemanager.hostname</name>
 <value>kfk-samza01</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>3</value>
</property>
</configuration>

In the job options file I have configured added the following:

yarn.container.memory.mb=256
yarn.am.container.memory.mb=256

task.opts= -Xms128M -Xmx128M

When the jobs are running I can see that -Xms128M -Xmx128M options are ignored and go with default values.

I have seen the following error. It looks like some memory limit is preventing jobs going from Accepted to Running, but I can not find how to solve it.

Container [pid=23007,containerID=container_1443454508386_0003_01_000001] is running beyond virtual memory limits. Current usage: 13.9 MB of 256 MB physical memory used; 1.1 GB of 537.6 MB virtual memory used. Killing container

Actually jobs are just clean functions, so none of my code should be introducing noise.

Any idea of what is the problem?

UPDATE: After staying for about 10 minutes in ACCEPTED state It goes to FAILED. Here is a part of what I see in the yarn-root-resourcemanager-kfk-samza01.out log:

2015-09-30 14:08:07,000 INFO  [ResourceManager Event Processor] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(106)) - USER=root  OPERATION=AM Allocated Container     TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1443613686881_0001    CONTAINERID=container_1443613686881_0001_02_000001
2015-09-30 14:08:07,000 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(153)) - Assigned container container_1443613686881_0001_02_000001 of capacity <memory:1024, vCores:1> on host kfk-samza01:44816, which has 1 containers, <memory:1024, vCores:1> used and <memory:7168, vCores:7> available after allocation
2015-09-30 14:08:07,001 INFO  [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:assignContainer(1580)) - assignedContainer application attempt=appattempt_1443613686881_0001_000002 container=Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 clusterResource=<memory:16384, vCores:16>
2015-09-30 14:08:07,002 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:assignContainersToChildQueues(559)) - Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:1024, vCores:1>, usedCapacity=0.0625, absoluteUsedCapacity=0.0625, numApps=1, numContainers=1
2015-09-30 14:08:07,002 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:assignContainers(424)) - assignedContainer queue=root usedCapacity=0.0625 absoluteUsedCapacity=0.0625 used=<memory:1024, vCores:1> cluster=<memory:16384, vCores:16>
2015-09-30 14:08:07,005 INFO  [AsyncDispatcher event handler] security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for nodeId : kfk-samza01:44816 for container : container_1443613686881_0001_02_000001
2015-09-30 14:08:07,009 INFO  [AsyncDispatcher event handler] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from ALLOCATED to ACQUIRED
2015-09-30 14:08:07,009 INFO  [AsyncDispatcher event handler] security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:clearNodeSetForAttempt(146)) - Clear node set for appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,010 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:storeAttempt(1830)) - Storing attempt: AppId: application_1443613686881_0001 AttemptId: appattempt_1443613686881_0001_000002 MasterContainer: Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ]
2015-09-30 14:08:07,010 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from SCHEDULED to ALLOCATED_SAVING
2015-09-30 14:08:07,011 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from ALLOCATED_SAVING to ALLOCATED
2015-09-30 14:08:07,012 INFO  [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:run(253)) - Launching masterappattempt_1443613686881_0001_000002
2015-09-30 14:08:07,018 INFO  [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:launch(106)) - Setting up container Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] for AM appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,019 INFO  [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:createAMContainerLaunchContext(191)) - Command to launch container container_1443613686881_0001_02_000001 : export SAMZA_LOG_DIR=<LOG_DIR> && ln -sfn <LOG_DIR> logs && exec ./__package/bin/run-am.sh 1>logs/stdout 2>logs/stderr
2015-09-30 14:08:07,020 INFO  [pool-1-thread-3] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createAndGetAMRMToken(195)) - Create AMRMToken for ApplicationAttempt: appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,020 INFO  [pool-1-thread-3] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createPassword(307)) - Creating password for appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,064 INFO  [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:launch(127)) - Done launching container Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] for AM appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,065 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from ALLOCATED to LAUNCHED
2015-09-30 14:08:08,001 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from ACQUIRED to RUNNING
2015-09-30 14:21:26,930 INFO  [Ping Checker] util.AbstractLivelinessMonitor (AbstractLivelinessMonitor.java:run(127)) - Expired:appattempt_1443613686881_0001_000002 Timed out after 600 secs
2015-09-30 14:21:26,931 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1125)) - Updating application attempt appattempt_1443613686881_0001_000002 with final state: FAILED, and exit status: -1000
2015-09-30 14:21:26,931 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from LAUNCHED to FINAL_SAVING
2015-09-30 14:21:26,932 INFO  [AsyncDispatcher event handler] resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(677)) - Unregistering app attempt : appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,932 INFO  [AsyncDispatcher event handler] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished, removing password for appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,933 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from FINAL_SAVING to FAILED
2015-09-30 14:21:26,933 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:transition(1208)) - The number of failed attempts is 2. The max attempts is 2
2015-09-30 14:21:26,935 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:rememberTargetTransitionsAndStoreState(995)) - Updating application application_1443613686881_0001 with final state: FAILED
2015-09-30 14:21:26,937 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - application_1443613686881_0001 State change from ACCEPTED to FINAL_SAVING
2015-09-30 14:21:26,938 INFO  [ResourceManager Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(790)) - Application Attempt appattempt_1443613686881_0001_000002 is done. finalState=FAILED
2015-09-30 14:21:26,938 INFO  [AsyncDispatcher event handler] recovery.RMStateStore (RMStateStore.java:transition(161)) - Updating info for app: application_1443613686881_0001
2015-09-30 14:21:26,939 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from RUNNING to KILLED
2015-09-30 14:21:26,939 INFO  [ResourceManager Event Processor] fica.FiCaSchedulerApp (FiCaSchedulerApp.java:containerCompleted(113)) - Completed container: container_1443613686881_0001_02_000001 in state: KILLED event:KILL
2015-09-30 14:21:26,939 INFO  [ResourceManager Event Processor] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(106)) - USER=root  OPERATION=AM Released Container      TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1443613686881_0001    CONTAINERID=container_1443613686881_0001_02_000001
2015-09-30 14:21:26,940 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:releaseContainer(216)) - Released container container_1443613686881_0001_02_000001 of capacity <memory:1024, vCores:1> on host kfk-samza01:44816, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:8> available, release resources=true
2015-09-30 14:21:26,940 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:transition(945)) - Application application_1443613686881_0001 failed 2 times due to ApplicationMaster for attempt appattempt_1443613686881_0001_000002 timed out. Failing the application.
2015-09-30 14:21:26,940 INFO  [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:releaseResource(1732)) - default used=<memory:0, vCores:0> numContainers=0 user=root user-resources=<memory:0, vCores:0>
2015-09-30 14:21:26,943 INFO  [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:completedContainer(1683)) - completedContainer container=Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 cluster=<memory:16384, vCores:16>
2015-09-30 14:21:26,943 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:completedContainer(604)) - completedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:16384, vCores:16>
2015-09-30 14:21:26,944 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:completedContainer(622)) - Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0
2015-09-30 14:21:26,944 INFO  [ResourceManager Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1274)) - Application attempt appattempt_1443613686881_0001_000002 released container container_1443613686881_0001_02_000001 on node: host: kfk-samza01:44816 #containers=0 available=8192 used=0 with event: KILL
2015-09-30 14:21:26,945 INFO  [ResourceManager Event Processor] scheduler.AppSchedulingInfo (AppSchedulingInfo.java:clearRequests(115)) - Application application_1443613686881_0001 requests cleared
2015-09-30 14:21:26,945 INFO  [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:removeApplicationAttempt(682)) - Application removed - appId: application_1443613686881_0001 user: root queue: default #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2015-09-30 14:21:26,946 INFO  [pool-1-thread-4] amlauncher.AMLauncher (AMLauncher.java:run(267)) - Cleaning master appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,948 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - application_1443613686881_0001 State change from FINAL_SAVING to FAILED
2015-09-30 14:21:26,949 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:removeApplication(372)) - Application removed - appId: application_1443613686881_0001 user: root leaf-queue of parent: root #applications: 0
2015-09-30 14:21:26,951 WARN  [AsyncDispatcher event handler] resourcemanager.RMAuditLogger (RMAuditLogger.java:logFailure(263)) - USER=root    OPERATION=Application Finished - Failed      TARGET=RMAppManager     RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED       PERMISSIONS=Application application_1443613686881_0001 failed 2 times due to ApplicationMaster for attempt appattempt_1443613686881_0001_000002 timed out. Failing the application.  APPID=application_1443613686881_0001
2015-09-30 14:21:26,955 INFO  [AsyncDispatcher event handler] resourcemanager.RMAppManager$ApplicationSummary (RMAppManager.java:logAppSummary(179)) - appId=application_1443613686881_0001,name=flow.Router_1,user=root,queue=default,state=FAILED,trackingUrl=http://kfk-samza01:8088/cluster/app/application_1443613686881_0001,appMasterHost=N/A,startTime=1443614243319,finishTime=1443615686935,finalStatus=FAILED

Any clue of what is happening?

Upvotes: 0

Views: 1326

Answers (2)

jordi
jordi

Reputation: 1187

Finally I've had two problems in parallel. One, the memory limits that have been resolved as hserus has kindly explained.

The other one was a comunication problem with the kafka servers that has provoqued a corruption of the topics, and so the jobs were unable to run.

Upvotes: 1

suresiva
suresiva

Reputation: 3173

Please try the below job configuration properties to limit the container memory allocation.

mapreduce.map.memory.mb
mapreduce.reduce.memory.mb

These two properties values can be 256MB as per your case.

and also configure the below two properties,

mapreduce.map.java.opts
mapreduce.reduce.java.opts

The value for these 2 properties shall be 128MB as per your case.

[Note: the above two *.java.opts value must be little lower than respective *.memory.mb properties]

If you still continue to get the virtual memory issue then try reducing the ration value of the virtual memory allocation by configuring the below property.

yarn.nodemanager.vmem-pmem-ratio

Default is 2.1, try reducing it if you still get the virtual memory issue.

Once these properties are correctly set then you will get the containers cleared upon successful completion.

Hope this helps.

Upvotes: 1

Related Questions