Simon
Simon

Reputation: 133

Bull.js jobs stalling despite timeout being set

I have a Bull queue running lengthy video upload jobs which could take any amount of time from < 1 min up to many minutes.

The jobs stall after the default 30 seconds, so I increased the timeout to several minutes, but this is not respected. If I set the timeout to 10ms it immediately stalls, so it is taking timeout into account.

Job {
      opts: {
      attempts: 1,
      timeout: 600000,
      delay: 0,
      timestamp: 1634753060062,
      backoff: undefined
    }, 
    ...
}

Despite the timeout, I am receiving a stalled event, and the job starts to process again.

EDIT: I thought "stalling" was the same as timing out, but apparently there is a separate timeout for how often Bull checks for stalled jobs. In other words the real problem is why jobs are considered "stalled" even though they are busy performing an upload.

Upvotes: 3

Views: 6377

Answers (2)

Ashish Agarwal
Ashish Agarwal

Reputation: 21

The better approach is to use job.progress() function wherein long running task can update the progress at a regular instance to avoid the event loop from stalling the event.

https://github.com/OptimalBits/bull/blob/HEAD/REFERENCE.md#jobprogress

Also, you can log to help you better troubleshoot.

.on('stalled', function (job) {
  // A job has been marked as stalled. This is useful for debugging job
  // workers that crash or pause the event loop.
})

Upvotes: 2

alikh31
alikh31

Reputation: 49

The problem seems to be your job stalling because of the operation you are running which blocks the event loop. you could convert your code into a non-blocking one and solve the problem that way.

That being said, stalled interval check could be set in queue settings while initiating the queue (more of a quick solution):

const queue = new Bull('queue', {
    port: 6379,
    host: 'localhost',
    db: 0,
    settings: {
      stalledInterval: 60 * 60 * 1000, // change default from 30 sec to 1 hour, set 0 for disabling the stalled interval
    },
  })

based on bull's doc:

  • timeout: The number of milliseconds after which the job should be fail with a timeout error
  • stalledInterval: How often check for stalled jobs (use 0 for never checking)

Increasing the stalledInterval (or disabling it by setting it as 0) would remove the check that makes sure event loop is running thus enforcing the system to ignore the stall state.

again for docs:

When a worker is processing a job it will keep the job "locked" so other workers can't process it.

It's important to understand how locking works to prevent your jobs from losing their lock - becoming _stalled_ -
and being restarted as a result. Locking is implemented internally by creating a lock for `lockDuration` on interval
`lockRenewTime` (which is usually half `lockDuration`). If `lockDuration` elapses before the lock can be renewed,
the job will be considered stalled and is automatically restarted; it will be __double processed__. This can happen when:
1. The Node process running your job processor unexpectedly terminates.
2. Your job processor was too CPU-intensive and stalled the Node event loop, and as a result, Bull couldn't renew the job lock (see [#488](https://github.com/OptimalBits/bull/issues/488) for how we might better detect this). You can fix this by breaking your job processor into smaller parts so that no single part can block the Node event loop. Alternatively, you can pass a larger value for the `lockDuration` setting (with the tradeoff being that it will take longer to recognize a real stalled job).

As such, you should always listen for the `stalled` event and log this to your error monitoring system, as this means your jobs are likely getting double-processed.

As a safeguard so problematic jobs won't get restarted indefinitely (e.g. if the job processor always crashes its Node process), jobs will be recovered from a stalled state a maximum of `maxStalledCount` times (default: `1`).

Upvotes: 2

Related Questions