Reputation: 16416
We're currently running Hortonworks 2.6.5.0:
$ hadoop version
Hadoop 2.7.3.2.6.5.0-292
Subversion [email protected]:hortonworks/hadoop.git -r 3091053c59a62c82d82c9f778c48bde5ef0a89a1
Compiled by jenkins on 2018-05-11T07:53Z
Compiled with protoc 2.5.0
From source with checksum abed71da5bc89062f6f6711179f2058
This command was run using /usr/hdp/2.6.5.0-292/hadoop/hadoop-common-2.7.3.2.6.5.0-292.jar
The OS is CentOS 7:
$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
We recently started noticing these issues in the ambari-agent
's log file:
$ grep -i "error|warn" /var/log/ambari-agent/*
/var/log/ambari-agent/ambari-agent.log:WARNING 2018-07-30 14:03:50,982 NetUtil.py:124 - Server at https://hbase26-2.mydom.com:8440 is not reachable, sleeping for 10 seconds...
/var/log/ambari-agent/ambari-agent.log:ERROR 2018-07-30 14:04:00,986 NetUtil.py:96 - EOF occurred in violation of protocol (_ssl.c:579)
/var/log/ambari-agent/ambari-agent.log:ERROR 2018-07-30 14:04:00,990 NetUtil.py:97 - SSLError: Failed to connect. Please check openssl library versions.
/var/log/ambari-agent/ambari-agent.log:WARNING 2018-07-30 14:04:00,990 NetUtil.py:124 - Server at https://hbase26-2.aa.mydom.com:8440 is not reachable, sleeping for 10 seconds...
/var/log/ambari-agent/ambari-agent.log:ERROR 2018-07-30 14:04:10,993 NetUtil.py:96 - EOF occurred in violation of protocol (_ssl.c:579)
/var/log/ambari-agent/ambari-agent.log:ERROR 2018-07-30 14:04:10,994 NetUtil.py:97 - SSLError: Failed to connect. Please check openssl library versions.
/var/log/ambari-agent/ambari-agent.log:WARNING 2018-07-30 14:04:10,994 NetUtil.py:124 - Server at https://hbase26-2.aa.mydom.com:8440 is not reachable, sleeping for 10 seconds...
/var/log/ambari-agent/ambari-agent.log:ERROR 2018-07-30 14:04:20,996 NetUtil.py:96 - EOF occurred in violation of protocol (_ssl.c:579)
/var/log/ambari-agent/ambari-agent.log:ERROR 2018-07-30 14:04:20,997 NetUtil.py:97 - SSLError: Failed to connect. Please check openssl library versions.
When these started occurring we could no longer manage any aspects of the Hadoop cluster through Ambari. All the services showed little yellow question marks and said "heartbeat lost".
Multiple restarts would not allow us to resume Ambari, and ultimately regain control our cluster.
Upvotes: 2
Views: 3596
Reputation: 16416
This issue turned out to be due to the server's inability to deal with TLSv1.1 certificates when it was attempting to connect to the CA service on port 8440.
We noticed that the service was in fact running:
$ netstat -tapn|grep 8440
tcp 0 0 0.0.0.0:8440 0.0.0.0:* LISTEN 1203/java
But curl
's to this would fail, unless we disabled TLS checks via the --insecure
switch. This was our first clue that it appeared to be something related to TLS.
Further investigations led us to NetUtil.py (part of Ambari) which seemed OK. Other leads include:
$ cat /etc/ambari-agent/conf/ambari-agent.ini
...
[security]
ssl_verify_cert = 0
...
And this:
$ grep -E '\[https|verify' /etc/python/cert-verification.cfg
[https]
#verify=platform_default
verify=disable
None of which worked. What did ultimately work is this, Forcing ambari-agent
to use TLSv1.2 vs. TLS1.1:
$ grep -E "\[security|force" /etc/ambari-agent/conf/ambari-agent.ini
[security]
force_https_protocol=PROTOCOL_TLSv1_2
And then restarting, ambari-agent restart
.
I was able to piece this all together using wisps of hints scattered all over the Internet. I'm putting this here in the hopes it will help any other poor souls that have this happen to their Hadoop/Hortonworks cluster.
Further debugging/digging I found this thread titled: Disabling TLSv1 & TLS1.1 - Enabling TLSv1.2. It's apparently mandatory that you now configure your Ambari Agent's to use TLSv1.2.
Upvotes: 10