pratpor
pratpor

Reputation: 2104

YARN jobs getting stuck in ACCEPTED state despite memory available

Cluster goes into deadlock state and stops allocating containers even when GBs of RAM and Vcores are available.

This was happening only when we start a lot of jobs in parallel most of which were Oozie jobs with many forked actions.

Upvotes: 0

Views: 1518

Answers (1)

pratpor
pratpor

Reputation: 2104

After a lot of search and reading related questions and articles, we came across a property called maxAMShare for YARN job scheduler (we are using Fair Scheduler).

What it means?

Percentage of memory and vcores from user's queue share that can be allotted to Application Masters. Default value: 0.5 (50%). Source

How it caused the deadlock?

When we will start multiple oozie jobs in parallel, each oozie job and the forked actions require couple of ApplicationMaster containers to be allotted first for oozie launchers which then start the other containers to do the actual action task.

In our case, we were actually starting around 20-30 oozie jobs in parallel, each with close to 20 forked actions. And with each action requiring 2 ApplicationMasters, close to 800 containers were getting blocked only by the Oozie ApplicationMasters.

Due to this, we were hitting the 50% default maxAMShare limit for our user queue. And YARN was not allowing to create new ApplicationMasters to run the actual job.

Solution?

  1. One instant suggestion could be to disable the check by setting this property to -1.0. But this is not recommended. You can again end up allocating all or most of the resources to AMs and the real job that will get done will be very less.

  2. Other option (which we went ahead with) is to specify a separate queue for AMs in the oozie configuration and then set maxAMShare property to 1.0. This way you can control how much resources can be allocated to AMs without affecting the other jobs. Reference

<global>
    <configuration>
        <property>
            <name>oozie.launcher.mapred.job.queue.name</name>
            <value>root.users.oozie_am_queue</value>
        </property>
    </configuration>
</global>

Dynamic Resource Pool Configuration

Hope this will be a major time saver for people facing the same issue. There could be many others reasons for deadlock too which are already discussed in other questions on SO.

Upvotes: 3

Related Questions