Reputation: 2104
Cluster goes into deadlock state and stops allocating containers even when GBs of RAM and Vcores are available.
This was happening only when we start a lot of jobs in parallel most of which were Oozie
jobs with many fork
ed actions.
Upvotes: 0
Views: 1518
Reputation: 2104
After a lot of search and reading related questions and articles, we came across a property called maxAMShare
for YARN job scheduler (we are using Fair Scheduler).
What it means?
Percentage of memory and vcores from user's queue share that can be allotted to Application Masters. Default value: 0.5 (50%). Source
How it caused the deadlock?
When we will start multiple oozie jobs in parallel, each oozie job and the forked actions require couple of ApplicationMaster containers to be allotted first for oozie launchers which then start the other containers to do the actual action task.
In our case, we were actually starting around 20-30 oozie jobs in parallel, each with close to 20 forked actions. And with each action requiring 2 ApplicationMasters, close to 800 containers were getting blocked only by the Oozie ApplicationMasters.
Due to this, we were hitting the 50% default maxAMShare
limit for our user queue. And YARN was not allowing to create new ApplicationMasters to run the actual job.
Solution?
One instant suggestion could be to disable the check by setting this property to -1.0. But this is not recommended. You can again end up allocating all or most of the resources to AMs and the real job that will get done will be very less.
Other option (which we went ahead with) is to specify a separate queue for AMs in the oozie configuration and then set maxAMShare property to 1.0. This way you can control how much resources can be allocated to AMs without affecting the other jobs. Reference
<global> <configuration> <property> <name>oozie.launcher.mapred.job.queue.name</name> <value>root.users.oozie_am_queue</value> </property> </configuration> </global>
Hope this will be a major time saver for people facing the same issue. There could be many others reasons for deadlock too which are already discussed in other questions on SO.
Upvotes: 3