liaoyue
liaoyue

Reputation: 40

How to assign memory for taskmanager properly?

flink version: 1.13.6 I have a job which has configuration as below.

-yjm 2048 -ytm 4096  
-yD taskmanager.memory.jvm-metaspace.size=256mb  
-yD state.backend=rocksdb  
-yD state.backend.incremental=true  
-yD parallelism.default=8   

It has been restarting about every 8 hours since the job was running. And the exception was the same

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/x.x.x.x:39365'. This might indicate that the remote task manager was lost.

I have encountered this problem multiple times. This usually caused by insufficient disk or insufficient memory of the node which tm running on. So, I check the disk and doubled the memory allocated to tm. But this still happens. I am quite sure that the problem is not on the linux node, because there are also other taskmanagers running well on the node. Then, I checked the FLink Memory Model of the taskmanager on web UI and found that the used percentage of Managed Memory is always 100% while the used percentage of JVM Heap is quite small. Then I add a configuration -yD taskmanager.memory.managed.size=4096mb. Then, problem solved. Here are my questions

  1. How can I get to know exactly which part of the Flink Memory Model has insufficient memory and caused the problem? Is there a log or something else?
  2. Although the restarting was gone, but the Managed Memory is still 100% used. I checked the checkpoint directory on s3 and find out that the total state size of this job is about 1.9gb. And there is no data skew between the taskmanagers. How come a total memory of 64gb for all taskmanagers is still not enough for this job or how can I get to know which part of the job used the Managed Memory?

Upvotes: 0

Views: 581

Answers (1)

David Anderson
David Anderson

Reputation: 43419

This issue is probably caused by RocksDB. RocksDB will try to use every bit of available memory, and it can be overzealous and use more than it should. This was especially a problem in versions of Flink before 1.14, when the RocksDB version Flink uses was finally upgraded.

https://segmentfault.com/a/1190000041453444/en gives some background info.

Upvotes: 1

Related Questions