KiranM
KiranM

Reputation: 1323

Hadoop standalone instance exits during executing an MR job with ExpiredTokenRemover error in log (after few jobs executed successfully)

Hadoop/HDFS processes exit (all jps deamons) and user is thrown out from terminal when it is running an MR job, after few jobs completed successfully.

Error: 2016-07-23 17:56:16,258 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted

Log file: /usr/local/hadoop/logs/yarn-hduser-resourcemanager-KMUbLptp.log

2016-07-23 17:56:14,044 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1469316920580_0007_01_000002 Container Transitioned from ACQUIRED to RUNNING
2016-07-23 17:56:14,663 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: checking for deactivate of application :application_1469316920580_0007
2016-07-23 17:56:16,201 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: RECEIVED SIGNAL 15: SIGTERM
2016-07-23 17:56:16,258 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2016-07-23 17:56:16,259 INFO org.mortbay.log: Stopped [email protected]:8088
2016-07-23 17:56:16,284 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: RECEIVED SIGNAL 15: SIGTERM
2016-07-23 17:56:16,360 INFO org.apache.hadoop.ipc.Server: Stopping server on 8032
2016-07-23 17:56:16,361 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8032
2016-07-23 17:56:16,361 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2016-07-23 17:56:16,362 INFO org.apache.hadoop.ipc.Server: Stopping server on 8033

This error is happening only after below line on job submit terminal:

16/07/23 17:56:13 INFO mapreduce.Job:  map 0% reduce 0%

Environment:

Ubuntu Desktop 16 LTE, jdk1.8.92 & Hadoop 2.7.2

I think it could be some timeout, it works if I restart my machine & start over again. I would appreciate if somebody has encountered this issue.

Upvotes: 0

Views: 2194

Answers (1)

hakvroot
hakvroot

Reputation: 326

I have observed the exact same behavior but, "fortunately", I had a 100% reproduction rate inside my Docker containers. All Hadoop processes received a SIGTERM after a process was finished or canceled, shutting them down in an undesired (but orderly) fashion.

As this started happening after I changed the Ubuntu version of my image from 14.04 to 16.04 I decided to see where the SIGTERMs are coming from (using systemtap with https://sourceware.org/systemtap/examples/process/sigmon.stp) and it turned out the NodeManager was starting /bin/kill and was targeting itself. Given that only the Ubuntu version changed, and not the Hadoop version, I went to search for known issues with kill under Ubuntu 16.04 and ended up at this - https://bugs.launchpad.net/ubuntu/+source/alsa-driver/+bug/1610499 - bug report. Turns out that whenever the NodeManager kills a container in Ubuntu 16.04 it can take down all the processes owned by the same user with it.

I worked around this issue by replacing /bin/kill with the version from Ubuntu 14.04. Not ideal, but easy enough to apply in a Docker setting. Alternatively you'll have to wait for (or submit :) a fix or consider using an older Ubuntu version (or other distro). Hope that helps!

Upvotes: 1

Related Questions