Reputation: 40
flink version: 1.13.6 I have a job which has configuration as below.
-yjm 2048 -ytm 4096
-yD taskmanager.memory.jvm-metaspace.size=256mb
-yD state.backend=rocksdb
-yD state.backend.incremental=true
-yD parallelism.default=8
It has been restarting about every 8 hours since the job was running. And the exception was the same
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/x.x.x.x:39365'. This might indicate that the remote task manager was lost.
I have encountered this problem multiple times. This usually caused by insufficient disk or insufficient memory of the node which tm running on. So, I check the disk and doubled the memory allocated to tm. But this still happens. I am quite sure that the problem is not on the linux node, because there are also other taskmanagers running well on the node. Then, I checked the FLink Memory Model of the taskmanager on web UI and found that the used percentage of Managed Memory is always 100% while the used percentage of JVM Heap is quite small. Then I add a configuration -yD taskmanager.memory.managed.size=4096mb. Then, problem solved. Here are my questions
Upvotes: 0
Views: 581
Reputation: 43419
This issue is probably caused by RocksDB. RocksDB will try to use every bit of available memory, and it can be overzealous and use more than it should. This was especially a problem in versions of Flink before 1.14, when the RocksDB version Flink uses was finally upgraded.
https://segmentfault.com/a/1190000041453444/en gives some background info.
Upvotes: 1