Miky Mouse
Miky Mouse

Reputation: 31

hbase regionservers are not communicating with master

I am trying to get working bhase cluster. Two masters and two region servers. My problem is that regionserver complains about telling master that they are up.:

2016-07-01 16:10:21,879 WARN  [regionserver/nbd-hadoop-data1/153.77.130.27:60020] **regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.**
2016-07-01 16:10:24,879 INFO  [regionserver/nbd-hadoop-data1/153.77.130.27:60020] **regionserver.HRegionServer: reportForDuty to master=0.0.0.0,60000,1467381897236 with port=60020, startcode=1467382178755**
2016-07-01 16:10:24,879 DEBUG [regionserver/nbd-hadoop-data1/153.77.130.27:60020] ipc.AbstractRpcClient: Use SIMPLE authentication for service RegionServerStatusService, sasl=false
2016-07-01 16:10:24,880 DEBUG [regionserver/nbd-hadoop-data1/153.77.130.27:60020] ipc.AbstractRpcClient: Connecting to /0.0.0.0:60000
2016-07-01 16:10:24,880 WARN  [regionserver/nbd-hadoop-data1/153.77.130.27:60020] regionserver.HRegionServer: error telling master we are up
com.google.protobuf.ServiceException: java.net.ConnectException: Connection refused
    at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
    at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
    at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2270)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:894)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)

strange thing is that it opens port on 0.0.0.0:

Master server is waiting for region servers:

2016-07-01 16:08:43,495 INFO  [0.0.0.0:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 220970 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.

but when I stop regionserver master( Zookeeper) recognises that regionserver went to offline:

2016-07-01 16:55:25,124 WARN  [main-EventThread] zookeeper.RegionServerTracker: nbd-hadoop-data1,60020,1467384161702 is not online or isn't known to the master.The latter could be caused by a DNS misconfiguration.
2016-07-01 16:55:26,509 INFO  [0.0.0.0:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 3023984 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.

My hbase cluster configuration is

153.77.130.29 nbd-hadoop-nn1 - zookeeper, hdfs, hbase master
153.77.130.30 nbd-hadoop-nn2 -zookeeper, hdfs, hbase master
153.77.130.22 nbd-service - zookeeper
153.77.130.27 nbd-hadoop-data1 hbase regionserver 1
153.77.130.28 nbd-hadoop-data2 hbase regionserver 2


all machines has got **/etc/hosts** setup in the following way:

127.0.0.1       localhost       localhost.localdomain localhost4 localhost4.localdomain4
::1     localhost       localhost.localdomain localhost6 localhost6.localdomain6

127.0.0.1       nbd-hadoop-nn1
153.77.130.22 nbd-service
153.77.130.29 nbd-hadoop-nn1
153.77.130.30 nbd-hadoop-nn2
153.77.130.27 nbd-hadoop-data1
153.77.130.28 nbd-hadoop-data2

Master server bhase-site.xml:

<property>
      <name>hbase.master.port</name>
      <value>60000</value>
    </property>

    <property>
      <name>hbase.regionserver.global.memstore.lowerLimit</name>
      <value>0.38</value>
    </property>

    <property>
      <name>hbase.regionserver.global.memstore.upperLimit</name>
      <value>0.4</value>
    </property>

    <property>
      <name>hbase.regionserver.handler.count</name>
      <value>60</value>
    </property>

    <property>
      <name>hbase.regionserver.info.port</name>
      <value>60030</value>
    </property>

 <property>
      <name>hbase.regionserver.port</name>
      <value>60020</value>
    </property>

Region server bhase-site.xml:

 <property>
      <name>hbase.master.info.port</name>
      <value>60010</value>
    </property>

    <property>
      <name>hbase.master.port</name>
      <value>60000</value>
    </property>

    <property>
      <name>hbase.regionserver.global.memstore.lowerLimit</name>
      <value>0.38</value>
    </property>

    <property>
      <name>hbase.regionserver.global.memstore.upperLimit</name>
      <value>0.4</value>
    </property>

    <property>
      <name>hbase.regionserver.handler.count</name>
      <value>60</value>
    </property>
 <property>
      <name>hbase.regionserver.port</name>
      <value>60020</value>
    </property>

  <property>
      <name>hbase.regionserver.info.port</name>
      <value>60030</value>
    </property>

netstat -ntlp from Master server nbd-hadoop-nn1 (show corretly open port 60000 at :::):

tcp        0      0 :::60000                    :::*                        LISTEN      30839/java

netstat -ntlp from Region server nbd-hadoop-data1 shows that port 60020 is bind to localhost. Which I think is the root of the issue:

tcp        0      0 ::ffff:127.0.0.1:60020      :::*                        LISTEN      22858/java

I am not able to telnet on Regions server's port 60020 from Master server telnet nbd-hadoop-data1 60020 ** - connection refuse. This is probably the root of the problem but I don't know how to reconfigure it. I didn't find anywhere why is region server opening port at ::ffff:127.0.0.1:60020.

Many thanks for your tips. If you need additional logs or configuration files I will provide it.

Upvotes: 3

Views: 3931

Answers (2)

V. Jay
V. Jay

Reputation: 1

I also had the exact problem!

In your case, the entry 127.0.0.1 nbd-hadoop-nn1 is resolving to localhost.
Apparently hbase/zookeeper needs to know the actual IP address in a distributed mode.

I don't know the hbase internals, but if you remove this entry then it will work like charm! I have my own dns server so specifying the hostname is sufficient for me and I don't need to use /etc/hosts file at all. In fact I had this issue because all my machines in the cluster had 127.0.0.1 localhost machine<n> entries in /etc/hosts file! So thanks to @miky I knew exactly where to look for resolving this issue! My machine provisioning sets up the /etc/hosts file with hostname entry and I introduced dns server in my network recently, so time to lose this practice!

Upvotes: 0

Miky Mouse
Miky Mouse

Reputation: 31

The issue is solved. The problem was caused by loopback in my /etc/hosts files 127.0.01 hostname.

Upvotes: 0

Related Questions