puppeteer-cluster: Setting a timeout on individual execution tasks

Question

I'm trying to get individual tasks to throw a time-out during stress testing to see what my calling program will do. However, my cluster keeps tasks fresh indefinitely. It appears to queue all my cluster.execute calls which then are kept in memory and return their results to listeners that have long since disconnected.

The docs state:

timeout Specify a timeout for all tasks. Defaults to 30000 (30 seconds).

My cluster launch configuration:

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 1,
    timeout: 1000 //milliseconds
});

I'm calling the queuing mechanism using:

const pdf = await cluster.execute(html, makePdf);

Where makePdf is an async function that expects a HTML string, fills a page with it and prints a PDF using the default puppeteer.

const makePdf = async ({ page, data: html, worker }) => {
    await page.setContent(html);
    let pdf = await page.pdf({});
    console.log('worker ' + worker.id + ' task ' + count);
    return pdf;
};

I sort of expected the queue to start emptying itself until it found a task that didn't exceed its timeout value. I've tried setting the timeout to 1 ms but this doesn't trigger a timeout either. I've tried moving this code to a cluster.task as described in the examples to see if that would trigger the setting, but no such luck. How do I get already queued requests to time out? Does this even work if I'm not scraping websites or connecting to anything?

I'm considering to pass a timestamp along with my tasks so it can skip doing anything for requests that have expired on the calling side, but I'd rather use built-in options wherever possible.

EDIT:

Thanks to Thomas's clarification I've decided to build this little optimization to prevent tasks where the listeners are long gone from executing.

Swap the content of data from just html with a json that has both the url and timestamp:

let timestamp = new Date();
await cluster.execute({html, timestamp});

Ignore any queued task where the listener has timed out:

const makePdf = async ({ page, data: { html, timestamp }, worker }) => {
    let time_since_call = (new Date() - timestamp);
    if (time_since_call < timeout_ms) {
        await page.setContent(html);
        let pdf = await page.pdf({});
        return pdf;
    } 
};

Thomas Dondorf · Accepted Answer

This is a misunderstanding what timeout does. The timeout option is the timeout for the task, meaning that the job itself (after leaving the queue) cannot take longer than the specified timeout. The option does not cancel a queued job that is still in the queue.

Example:

const cluster = await Cluster.launch({
    // ...
    maxConcurrency: 1,
    timeout: 1000 // one second
});
// ...
for (let i = 0; i < 10; i += 1) {
    cluster.queue('...');
}

This code adds 10 jobs and runs them sequentially (as maxConcurrency is 1). There is no different between queue and execute here (see this question for more information on this topic). So what happens is the following:

First job starts running
First job is interrupted after one second
Second job starts running
Second job is interrupted after one second
...

The use case you are describing is currently not supported by the library (btw, disclaimer: I'm the author), but as you proposed, you could add a timestamp to the object you are queuing and cancel the job right away if it is too far in the past.

puppeteer-cluster: Setting a timeout on individual execution tasks

Answers (1)

Related Questions