joss
joss

Reputation: 765

Flink yarn-session mode is becoming unstable when running ~10 batch jobs at same time

I am trying to set up a flink-yarn session to run ~100+ batch jobs. After getting connected to ~40 task managers and ~10 jobs running (each task manager with 2 slots and 1GB memory each) it looks like the session becomes unstable. There were enough resources available. The flink UI suddenly becomes not available, I guess the job manager might have died already. Eventually, the yarn application also got killed.

Job manager is running on 4 core 16GB node 12 gb available

Is there any guide to do the math for job manager resource vs the number of task managers it can handle?

Upvotes: 0

Views: 350

Answers (1)

joss
joss

Reputation: 765

I got this fixed. The reason the flink-session breaking was the low bandwidth of worker machines in the cluster. The worker machine which runs the task manager container should have at least 750Mbps or up. With each task manager having 2 slots and 1GB of memory, a moderate bandwidth ~ 450Mbps won't cut it. if the job is IO intensive, Communication between actors(job manager and workers or worker to the worker) could potentially get timed out(default ask time out is 100ms).

I decided to not to increase the ask timeout so that the jobs won't take long because of this bottleneck.

Upvotes: 1

Related Questions