Reputation: 23

Spark&hbase: java.io.IOException: Connection reset by peer

I would appreciate it if you could help me.

During implementation of spark streaming from kafka to hbase (code is attached) we have faced an issue “java.io.IOException: Connection reset by peer” (full log is attached).

This issue comes up if we work with hbase and dynamic allocation option is on in spark settings. In case we write data in hdfs (hive table) instead of hbase or if dynamic allocation option is off there are no errors found.

We have tried to change zookeeper connections, spark executor idle timeout, network timeout. We have tried to change shuffle block transfer service (NIO) but the error is still there. If we set min/max executers (less then 80) amount for dynamic allocation there are no problems too.

What may the problem be? There are a lot of almost the same problems in Jira and stack overflow, but nothing helps.

Versions:

HBase 1.2.0-cdh5.14.0
Kafka  3.0.0-1.3.0.0.p0.40
SPARK2 2.2.0.cloudera2-1.cdh5.12.0.p0.232957
hbase-client/hbase-spark(org.apache.hbase) 1.2.0-cdh5.11.1

Spark settings:

--num-executors=80
--conf spark.sql.shuffle.partitions=200
--conf spark.driver.memory=32g
--conf spark.executor.memory=32g
--conf spark.executor.cores=4

Cluster: 1+8 nodes, 70 CPU, 755Gb RAM, x10 HDD,

Log:

    18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 717 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 717 successfully in removeExecutor
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 717 has been removed (new total is 26)
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 705.
18/04/09 13:51:56 INFO scheduler.DAGScheduler: Executor lost: 705 (epoch 45)
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 705 from BlockManagerMaster.
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 705 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(705, lang32.ca.sbrf.ru, 22805, None)
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 705 has been removed (new total is 25)
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 705 successfully in removeExecutor
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 716.
18/04/09 13:51:56 INFO scheduler.DAGScheduler: Executor lost: 716 (epoch 45)
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 716 from BlockManagerMaster.
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 716 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(716, lang32.ca.sbrf.ru, 28678, None)
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 716 has been removed (new total is 24)
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 716 successfully in removeExecutor
18/04/09 13:51:56 WARN server.TransportChannelHandler: Exception in connection from /10.116.173.65:57542
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)
18/04/09 13:51:56 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /10.116.173.65:57542 is closed
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 548.

Upvotes: 1

Answers (2)

wandermonk

Reputation: 7366

Try setting these two parameters. Also try caching the Dataframe before writing to HBase.

spark.network.timeout

spark.executor.heartbeatInterval

Upvotes: 1

nemeth.io

Reputation: 11

Please see my related answer here: What are possible reasons for receiving TimeoutException: Futures timed out after [n seconds] when working with Spark

It also took me a while to understand why Cloudera is stating following:

Dynamic allocation and Spark Streaming

If you are using Spark Streaming, Cloudera recommends that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications.

Reference: https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#ki_dynamic_allocation_streaming

Upvotes: 0

Spark&amp;hbase: java.io.IOException: Connection reset by peer

Answers (2)

Dynamic allocation and Spark Streaming

Related Questions

Spark&hbase: java.io.IOException: Connection reset by peer