slow looping over pages and extracting data using puppeteer

Question

I have a table looking like that. All the names in the name columns are links that navigate to the next page.

|---------------------|------------------|
|         NAME        |         ID       |
|---------------------|------------------|
|        Name 1       |          1       |
|---------------------|------------------|
|        Name 2       |          2       |
|---------------------|------------------|
|        Name 3       |          3       |
|---------------------|------------------|

I am trying to grab the link, extract data from it and then return back to the table. However, there are over 4000 records in the table and everything is processed very slowly (around 1000ms per record)

Here is my code that:

//Grabs all table rows.
const items = await page.$$(domObjects.itemPageLink);

  for (let i = 0; i < items.length; i++) {
    await page.goto(url);
    await page.waitForSelector(domObjects.itemPageLink);
    let items = await page.$$(domObjects.itemPageLink);

    const item = items[i];

    let id = await item.$eval("td:last-of-type", node => node.innerText.split(",").map(item => item.trim()));
    let link = await item.$eval("td:first-of-type a", node => node.click());

    await page.waitForSelector(domObjects.itemPageWrapper);
    let itemDetailsPage = await page.$(domObjects.itemPageWrapper);

    let title = await itemDetailsPage.$eval(".page-header__title", title => title.innerText);
    console.log(title);
    console.log(id);
  }

Is there a way to speed up this so I can get all the results at once much quicker? I would like to use this for my API.

Thomas Dondorf · Accepted Answer

There are some minor code improvements and one major improvement which can be applied here.

Minor improvements: Use fewer puppeteer functions

The minor improvements boil down to using as few of the puppeteer functions as possible. Most of the puppeteer functions you use, are sending data from the Node.js environment to the browser environment via a WebSocket. While this only takes a few milliseconds, these milliseconds obviously add up in the long run. For more information on this, you can check out this question asking about the difference of using page.evaluate vs. using more puppeteer functions.

This means, to optimize your code, you can for example use querySelector inside of the page instead of running item.$eval multiple times. Another optimization is to directly use the result of page.waitForSelector. The function will return the node, when it appears. Therefore, you do not need to query it via page.$ again afterwards.

These are only minor improvements, which might slightly improve the crawling speed.

Major improvement: Use a puppeteer pool

Right now, you are using one browser with a single page to crawl multiple URLs. You can improve the speed of your script by using a pool of puppeteer resources to crawl multiple URLs in parallel. puppeteer-cluster allows you to do exactly that (disclaimer: I'm the author). The library takes a task and applies it to a number of URLs in parallel.

The number of parallel instances, you can use depends on your CPU, memory and throughput. The more you can use the better your crawling speed will be.

Code Sample

Below is a minimal example, adapting your code to extract the same data. The code first sets up a cluster with one browser and four pages. After that, a task function is defined which will be executed for each of the queued objects.

After this, one page instance of the cluster is used to extract the IDs and URLs from the initial page. The function given to the cluster.queue extracts the IDs and URLs from the page and calls cluster.queue with the objects being { id: ..., url: ... }. For each of the queued objects, the cluster.task function is executed, which then extracts the title and prints it out next to the passed ID.

// Setup your cluster with 4 pages
const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_PAGE,
    maxConcurrency: 4,
});

// Define the task for the pages (go the the URL, and extract the title)
await cluster.task(async ({ page, data: { id, url } }) => {
    await page.goto(url);
    const itemDetailsPage = await page.waitForSelector(domObjects.itemPageWrapper);
    const title = await itemDetailsPage.$eval('.page-header__title', title => title.innerText);
    console.log(id, url);
});

// use one page of the cluster, to extract the links (ignoring the task function above)
cluster.queue(({ page }) => {
    await page.goto(url); // URLs is given from outside
    // Extract the links and ids from the initial page by using a page of the cluster
    const itemData = await page.$$eval(domObjects.itemPageLink, items => items.map(item => ({
        id: item.querySelector('td:last-of-type').split(',').map(item => item.trim()),
        url: item.querySelector('td:first-of-type a').href,
    })));
    // Queue the data: { id: ..., url: ... } to start the process
    itemData.forEach(data => cluster.queue(data));
});

slow looping over pages and extracting data using puppeteer

Answers (1)

Minor improvements: Use fewer puppeteer functions

Major improvement: Use a puppeteer pool

Code Sample

Related Questions