Flink yarn-session mode is becoming unstable when running ~10 batch jobs at same time

Question

I am trying to set up a flink-yarn session to run ~100+ batch jobs. After getting connected to ~40 task managers and ~10 jobs running (each task manager with 2 slots and 1GB memory each) it looks like the session becomes unstable. There were enough resources available. The flink UI suddenly becomes not available, I guess the job manager might have died already. Eventually, the yarn application also got killed.

Job manager is running on 4 core 16GB node 12 gb available

Is there any guide to do the math for job manager resource vs the number of task managers it can handle?

Flink yarn-session mode is becoming unstable when running ~10 batch jobs at same time

Answers (1)

Related Questions