Using Promise.all to achieve sort of multithreading with Puppeteer

I would like to consult may way of thinking.

Writing scraping bots or bots that perform certain activities on websites (using Puppeteer), I often need 'sort of' multithreading functionality, to be able to approach a number of pages at the same time and perform certain actions on them, preferably at the same time as well.

For this purpose, I use Promise.all() following this pattern:

const runInParallel = async(len) => {

    // create an array with a number of elements corresponding to required number of functions
    // to be performed at the same time 
 
    // these can also be URLs if I happen to know them beforehand 

    const iterations = [...Array(len).keys()]; 

    // create an array of promises that run in parallel 
    return await Promise.all(
        iterations.map(async i => {
            try {
                // use puppeteer to access a page, get data or perform certain actions 
                await scrape(); 
            } catch (e) {
                // handle error 
            } finally {
                // close page and browser
            }
        })
    ); 
}

Normally, the above is wrapped in yet another loop where each iteration awaits all the promises to be resolved/rejected before it starts the next iteration, this way i can access a number of pages at the same time, await all the actions to be completed on all of them and move to the next iteration where the process is repeated.

I am wondering what are the drawbacks of this approach and if there is a better alternative for the purpose of accessing a number of pages at the same time to either scrape data or perform certain actions on them.

Upvotes: 5

Answers (2)

ggorlen

Reputation: 56855

I often need 'sort of' multithreading functionality

This may sound pedantic, but Node is single-threaded, so there's no multithreading happening when you use Puppeteer like this. The browsers run in separate processes, so we're multiprocessing, spreading tasks across multiple workers that each run as separate applications.

I am wondering what are the drawbacks of this approach

The drawbacks are the same as any parallel work: if you're running more processes than logical cores, there's contention overhead. Browsers are huge memory hogs and if you run your system out of memory you'll start thrashing. A great way to bring your machine to its knees is to launch a few dozen browsers at once with Promise.all().

[my Promise.all() batch] is wrapped in yet another loop where each iteration awaits all the promises to be resolved/rejected before it starts the next iteration, this way i can access a number of pages at the same time, await all the actions to be completed on all of them and move to the next iteration where the process is repeated.

Say you've determined your machine can accept a maximum concurrency of 3 browsers at once. If you're using a Promise.all() pattern that runs 3 requests at a time, then each chunk of 3 will have to wait for the slowest of the 3 to finish before spawning the next chunk of 3.

The solution is a parallel task queue. This data structure is smart enough to plug in n tasks with a parallelism of k. Let's say we have 5 tasks and can tolerate 3 at once. We'll fire off the same initial batch of 3 as in your current strategy, but task 4 will be able to start as soon as the fastest of the tasks in the first batch completes, while task 5 will be able to start when the first of any tasks that are running when it's waiting at the head of the queue complete.

For example, consider tasks numbered a-e, with task times {a: 7, b: 2, c: 3, d: 5, e: 3} which are not known in advance. With Promise.all() and batches of 3, we have a total time of 12 time units:

123456789abc
------------
aaaaaaaddddd
bb     eee
ccc

With a task queue, the total time is 7 time units:

1234567
-------
aaaaaaa
bbddddd
ccceee

As mentioned in this answer, you may not need new browsers for every task. Using a page per task in one browser and blocking any site resources you don't need (stylesheets, images, scripts, etc) should provide a large speed increase.

Check out Puppeteer cluster for an off-the-shelf solution that offers page-based task queuing.

Upvotes: 1

Tamás Sallai

Reputation: 3365

This approach works, and it does real multithreading not just "sort of". Puppeteer launches Chromium in separate processes and they all run concurrently.

Watch out for memory issues though. I experimented with a similar strategy and I found that starting 2-3 browsers consume all the memory in my machine and everything grinds to a halt. You might want to limit how many scraping jobs you do in parallel. Here is an article I wrote how to solve this.

You can also optimize this workflow by using only 1 browser instance but different pages. That way, the scrape call can open a new tab, do the scraping, then close it, and multiple scrape jobs can share the same browser.

Upvotes: 2

Using Promise.all to achieve sort of multithreading with Puppeteer

Answers (2)

Related Questions