KDC
KDC

Reputation: 1471

Hortonworks Nodemanager starts but then fails: Connection refused to :8042

I'm trying to solve an issue with a newly added datanode on our Hortonworks cluster. The YARN namenode manager of the node would fail, shortly after starting. The following error message log is returned:

Connection failed to http://(ipaddress):8042/ws/v1/node/info (Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 166, in execute
    connection_timeout=curl_connection_timeout, kinit_timer_ms = kinit_timer_ms)
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 198, in curl_krb_request
    _, curl_stdout, curl_stderr = get_user_call_output(curl_command, user=user, env=kerberos_env)
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output
    raise ExecutionFailed(err_msg, code, files_output[0], files_output[1])
ExecutionFailed: Execution of 'curl --location-trusted -k --negotiate -u : -b /var/lib/ambari-agent/tmp/cookies/4268dd36-9f72-4be0-8d82-5f0a124a3a72 -c /var/lib/ambari-agent/tmp/cookies/4268dd36-9f72-4be0-8d82-5f0a124a3a72 http://gdcdrwhdb821.dir.ucb-group.com:8042/ws/v1/node/info --connect-timeout 5 --max-time 7 1>/tmp/tmp7pZrbM 2>/tmp/tmpgM4wdg' returned 7.   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to (ipaddress):8042; Connection refused
)

This doesn't really tell me WHY the connection was refused though, except that whatever Yarn process corresponds to port 8042 isn't running:

netstat -tulpn | grep 8042

I've been looking for another nodemanager log perhaps with more information, but cannot find anything useful under /var/log/hadoop-yarn or the yarn.nodemanager.local-dirs / yarn.nodemanager.log-dirs

Are there other places I can look for yarn nodemanager error logs? Does anyone know what could be causing this?

Edit: After re-checking I found this useful bit in /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-(ipaddress).log

2017-04-19 14:01:14,670 FATAL nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(549)) - Error starting NodeManager
org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService

Upvotes: 3

Views: 4522

Answers (5)

Samat Kurmanov
Samat Kurmanov

Reputation: 42

Need to increase timeout of healthy check in alerts.

Upvotes: 0

Danillo Gontijo
Danillo Gontijo

Reputation: 1

I stopped YARN in my HDP cluster and deleted /var/log/hadoop-yarn/nodemanager/recovery-state directory and started YARN again.

This worked for me too. I think that was permission file problem.

Upvotes: 0

PRASANNA SARAF
PRASANNA SARAF

Reputation: 383

Not sure if this helps now. Probably you might have already solved it.

You are using external shuffle service. This runs as an auxiliary service inside nodemanager service. Currently it's not able to find shuffle service jar in classpath.

Please add location of shuffle service jar to yarn.application.classpath in yarn-site.xml

Upvotes: 0

Khairul
Khairul

Reputation: 9

It is also working fine in my side. Please stop the yarn service on the specific node not full YARN service.

Upvotes: 0

thanuja
thanuja

Reputation: 556

Did you able to fix this?

I faced the similar issue today.

I stopped YARN in my HDP cluster and deleted /var/log/hadoop-yarn/nodemanager/recovery-state directory and started YARN again.

The nodemanager is running without failing now.

Upvotes: 2

Related Questions