Hugo
Hugo

Reputation: 21

Hadoop HA ERROR: Exception in doCheckpoint (IOException) Exception during image upload doCheckpoint

I am using Hadoop 3.2.2 in a cluster based on Windows 10 and on which the high availability is configured on HDFS using the Quorum Journal manager.

The system works just fine, I am able to transition nodes from active to standby state without issues, but I often get the following error message :

java.io.IOException: Exception during image upload
    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:315)
    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:502)
    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error writing request body to server
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:295)
    ... 6 more
Caused by: java.io.IOException: Error writing request body to server
    at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3597)
    at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3580)
    at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:377)
    at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:321)
    at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:295)
    at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:230)
    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:277)
    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:272)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748

My cluster setup is the following

A: Namenode, Zookeeper, ZKFC, Journal

B: Namenode, Zookeeper, ZKFC, Journal

C: Namenode, Zookeeper, ZKFC

D: Journal, Datanode

E,F,G....: Datanode

Here is my hdfs-site configuration

<configuration>
  <property>
    <name>dfs.nameservices</name>
    <value>mycluster</value>
    <description>Logical name for this new nameservice</description>
  </property>
  <property>
    <name>dfs.ha.namenodes.mycluster</name>
    <value>A,B,C</value>
    <description>Unique identifiers for each NameNode in the 
    nameservice</description>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.mycluster.A</name>
    <value>A:8020</value>
    <description>RPC address for NameNode 1, it is necessary to use the real host name of the machine instead of an aliases</description>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.mycluster.B</name>
    <value>B:8020</value>
    <description>RPC address for NameNode 2</description>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.mycluster.C</name>
    <value>C:8020</value>
    <description>RPC address for NameNode 3</description>
  </property>
  <property>
    <name>dfs.namenode.http-address.mycluster.A</name>
    <value>A:9870</value>
    <description>HTTP address for NameNode 1</description>
  </property>
  <property>
    <name>dfs.namenode.http-address.mycluster.B</name>
    <value>B:9870</value>
    <description>HTTP address for NameNode 2</description>
  </property>
  <property>
    <name>dfs.namenode.http-address.mycluster.C</name>
    <value>C:9870</value>
    <description>HTTP address for NameNode 3</description>
  </property>
  <property>
    <name>dfs.namenode.shared.edits.dir</name>
    <value>qjournal://A:8485;B:8485;D:8485/mycluster</value>
  </property>
  <property>
    <name>dfs.client.failover.proxy.provider.mycluster</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  </property>
  <property>
    <name>dfs.ha.fencing.methods</name>
    <value>shell(C:/mylocation/stop-namenode.bat $target_host)</value>
  </property>
  <property>
    <name>dfs.journalnode.edits.dir</name>
    <value>C:/hadoop-3.2.2/data/journal</value>
  </property>
  <property>
    <name>dfs.ha.automatic-failover.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>ha.zookeeper.quorum</name>
    <value>A:2181,B:2181,C:2181</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///C:/hadoop-3.2.2/data/dfs/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///C:/hadoop-3.2.2/data/dfs/datanode</value>
  </property>
  <property>
    <name>dfs.namenode.safemode.threshold-pct</name>
    <value>0.5f</value>
  </property>
  <property>
    <name>dfs.client.use.datanode.hostname</name>
    <value>true</value>
  </property>
  <property>
    <name>dfs.datanode.use.datanode.hostname</name>
    <value>true</value>
  </property>
</configuration>

Does someone got the same issue ? Am I missing something here ?

Upvotes: 1

Views: 603

Answers (1)

Ramesh K
Ramesh K

Reputation: 1

Not sure if this issue is resolved. It may be because of this change https://issues.apache.org/jira/browse/HADOOP-16886. Solution would be to add the desired value for hadoop.http.idle_timeout.ms in core-site.xml.

Upvotes: 0

Related Questions