Reputation: 1
Requirement: The Flink cluster (including the JobManager and TaskManagers) needs to operate continuously 24/7 to ensure that Flink jobs can be submitted and run without interruption.
Issue: The JVM Metaspace of both the JobManager and TaskManager continues to increase with each Flink job execution. This issue causes the Flink cluster to become unresponsive when it reaches approximately 95%, resulting in Flink job failures. I tested this with running a simple wordcount submitting it thousands of times where this cant have classes or classloaders to hang on to in flink Metaspace and yet it continues to increase.
I would appreciate any suggestions for potential solutions to this problem.
The only workaround currently available is to frequently restart the Flink cluster, but this is not the solution I am seeking.
Upvotes: 0
Views: 182
Reputation: 76547
You might consider using the available Flame Graph, which can help you dig into where your job is spending the bulk of its time on the change that there's a relationship within your job graph that's causing some sluggishness on either the CPU or allocated memory. If you are using Flink 1.19+, you can likely use Flink's Async Profiler within the Flink UI to evaluate and profile your running job to also identify these types of issues.
Some other considerations:
open()
function. Initialization outside of this, such as in the processElement()
function could cause the component to be initialized on every element, which could also explain these types of issues.Without knowing the specific details of your application, it's hard to say exactly what the problem could be.
Upvotes: 0