Reputation: 31
I am trying to get working bhase cluster. Two masters and two region servers. My problem is that regionserver complains about telling master that they are up.:
2016-07-01 16:10:21,879 WARN [regionserver/nbd-hadoop-data1/153.77.130.27:60020] **regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.**
2016-07-01 16:10:24,879 INFO [regionserver/nbd-hadoop-data1/153.77.130.27:60020] **regionserver.HRegionServer: reportForDuty to master=0.0.0.0,60000,1467381897236 with port=60020, startcode=1467382178755**
2016-07-01 16:10:24,879 DEBUG [regionserver/nbd-hadoop-data1/153.77.130.27:60020] ipc.AbstractRpcClient: Use SIMPLE authentication for service RegionServerStatusService, sasl=false
2016-07-01 16:10:24,880 DEBUG [regionserver/nbd-hadoop-data1/153.77.130.27:60020] ipc.AbstractRpcClient: Connecting to /0.0.0.0:60000
2016-07-01 16:10:24,880 WARN [regionserver/nbd-hadoop-data1/153.77.130.27:60020] regionserver.HRegionServer: error telling master we are up
com.google.protobuf.ServiceException: java.net.ConnectException: Connection refused
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982)
at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2270)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:894)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
strange thing is that it opens port on 0.0.0.0:
Master server is waiting for region servers:
2016-07-01 16:08:43,495 INFO [0.0.0.0:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 220970 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
but when I stop regionserver master( Zookeeper) recognises that regionserver went to offline:
2016-07-01 16:55:25,124 WARN [main-EventThread] zookeeper.RegionServerTracker: nbd-hadoop-data1,60020,1467384161702 is not online or isn't known to the master.The latter could be caused by a DNS misconfiguration.
2016-07-01 16:55:26,509 INFO [0.0.0.0:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 3023984 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
My hbase cluster configuration is
153.77.130.29 nbd-hadoop-nn1 - zookeeper, hdfs, hbase master
153.77.130.30 nbd-hadoop-nn2 -zookeeper, hdfs, hbase master
153.77.130.22 nbd-service - zookeeper
153.77.130.27 nbd-hadoop-data1 hbase regionserver 1
153.77.130.28 nbd-hadoop-data2 hbase regionserver 2
all machines has got **/etc/hosts**
setup in the following way:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1 nbd-hadoop-nn1
153.77.130.22 nbd-service
153.77.130.29 nbd-hadoop-nn1
153.77.130.30 nbd-hadoop-nn2
153.77.130.27 nbd-hadoop-data1
153.77.130.28 nbd-hadoop-data2
Master server bhase-site.xml
:
<property>
<name>hbase.master.port</name>
<value>60000</value>
</property>
<property>
<name>hbase.regionserver.global.memstore.lowerLimit</name>
<value>0.38</value>
</property>
<property>
<name>hbase.regionserver.global.memstore.upperLimit</name>
<value>0.4</value>
</property>
<property>
<name>hbase.regionserver.handler.count</name>
<value>60</value>
</property>
<property>
<name>hbase.regionserver.info.port</name>
<value>60030</value>
</property>
<property>
<name>hbase.regionserver.port</name>
<value>60020</value>
</property>
Region server bhase-site.xml
:
<property>
<name>hbase.master.info.port</name>
<value>60010</value>
</property>
<property>
<name>hbase.master.port</name>
<value>60000</value>
</property>
<property>
<name>hbase.regionserver.global.memstore.lowerLimit</name>
<value>0.38</value>
</property>
<property>
<name>hbase.regionserver.global.memstore.upperLimit</name>
<value>0.4</value>
</property>
<property>
<name>hbase.regionserver.handler.count</name>
<value>60</value>
</property>
<property>
<name>hbase.regionserver.port</name>
<value>60020</value>
</property>
<property>
<name>hbase.regionserver.info.port</name>
<value>60030</value>
</property>
netstat -ntlp
from Master server nbd-hadoop-nn1
(show corretly open port 60000 at :::):
tcp 0 0 :::60000 :::* LISTEN 30839/java
netstat -ntlp
from Region server nbd-hadoop-data1
shows that port 60020 is bind to localhost.
Which I think is the root of the issue:
tcp 0 0 ::ffff:127.0.0.1:60020 :::* LISTEN 22858/java
I am not able to telnet on Regions server's port 60020 from Master server telnet nbd-hadoop-data1 60020
** - connection refuse.
This is probably the root of the problem but I don't know how to reconfigure it. I didn't find anywhere why is region server opening port at ::ffff:127.0.0.1:60020
.
Many thanks for your tips. If you need additional logs or configuration files I will provide it.
Upvotes: 3
Views: 3931
Reputation: 1
I also had the exact problem!
In your case, the entry 127.0.0.1 nbd-hadoop-nn1
is resolving to localhost
.
Apparently hbase/zookeeper needs to know the actual IP address in a distributed mode.
I don't know the hbase internals, but if you remove this entry then it will work like charm! I have my own dns server so specifying the hostname is sufficient for me and I don't need to use /etc/hosts file at all. In fact I had this issue because all my machines in the cluster had 127.0.0.1 localhost machine<n>
entries in /etc/hosts file! So thanks to @miky I knew exactly where to look for resolving this issue! My machine provisioning sets up the /etc/hosts file with hostname entry and I introduced dns server in my network recently, so time to lose this practice!
Upvotes: 0
Reputation: 31
The issue is solved. The problem was caused by loopback in my /etc/hosts files 127.0.01 hostname.
Upvotes: 0