Reputation: 1471
I'm trying to solve an issue with a newly added datanode on our Hortonworks cluster. The YARN namenode manager of the node would fail, shortly after starting. The following error message log is returned:
Connection failed to http://(ipaddress):8042/ws/v1/node/info (Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 166, in execute
connection_timeout=curl_connection_timeout, kinit_timer_ms = kinit_timer_ms)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 198, in curl_krb_request
_, curl_stdout, curl_stderr = get_user_call_output(curl_command, user=user, env=kerberos_env)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output
raise ExecutionFailed(err_msg, code, files_output[0], files_output[1])
ExecutionFailed: Execution of 'curl --location-trusted -k --negotiate -u : -b /var/lib/ambari-agent/tmp/cookies/4268dd36-9f72-4be0-8d82-5f0a124a3a72 -c /var/lib/ambari-agent/tmp/cookies/4268dd36-9f72-4be0-8d82-5f0a124a3a72 http://gdcdrwhdb821.dir.ucb-group.com:8042/ws/v1/node/info --connect-timeout 5 --max-time 7 1>/tmp/tmp7pZrbM 2>/tmp/tmpgM4wdg' returned 7. % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed connect to (ipaddress):8042; Connection refused
)
This doesn't really tell me WHY the connection was refused though, except that whatever Yarn process corresponds to port 8042 isn't running:
netstat -tulpn | grep 8042
I've been looking for another nodemanager log perhaps with more information, but cannot find anything useful under /var/log/hadoop-yarn or the yarn.nodemanager.local-dirs / yarn.nodemanager.log-dirs
Are there other places I can look for yarn nodemanager error logs? Does anyone know what could be causing this?
Edit: After re-checking I found this useful bit in /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-(ipaddress).log
2017-04-19 14:01:14,670 FATAL nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(549)) - Error starting NodeManager
org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
Upvotes: 3
Views: 4522
Reputation: 1
I stopped YARN in my HDP cluster and deleted /var/log/hadoop-yarn/nodemanager/recovery-state directory and started YARN again.
This worked for me too. I think that was permission file problem.
Upvotes: 0
Reputation: 383
Not sure if this helps now. Probably you might have already solved it.
You are using external shuffle service. This runs as an auxiliary service inside nodemanager service. Currently it's not able to find shuffle service jar in classpath.
Please add location of shuffle service jar to yarn.application.classpath in yarn-site.xml
Upvotes: 0
Reputation: 9
It is also working fine in my side. Please stop the yarn service on the specific node not full YARN service.
Upvotes: 0
Reputation: 556
Did you able to fix this?
I faced the similar issue today.
I stopped YARN in my HDP cluster and deleted /var/log/hadoop-yarn/nodemanager/recovery-state directory and started YARN again.
The nodemanager is running without failing now.
Upvotes: 2