Marc Rasmussen
Marc Rasmussen

Reputation: 20555

Puppeteer crawling above 8000 sub pages

So i am attempting to crawl webshops to get a specific id (the so called EAM id) for all of their products. To do this i am using Puppeteer.

i created the following function:

async function asyncForEach(array, callback) {
    for (let index = 0; index < array.length; index++) {
        console.log("Iterating through array " + index +  " Of " +array.length);

        await callback(array[index], index, array)
    }
}

Now then i created the following script:

    await asyncForEach(productsToview, async function (productPage,index, arr) {
    if (productPage.indexOf("url") >= 0) {
        await page.goto(productPage);
        await page.waitForSelector('#site-wrapper');
        await page.click('#product-read-more-specs');
        await page.click('#tab-specs-trigger');
        const productToSave = await page.evaluate(() => {
            const $ = window.$;
            let product = {
                title: $('.product-title').text(),
                EAN: $('.spec-section').last().find('td').last().text(),
                price: $('.product-price-container').text().replace(/\s/g, '')
            };
            return product;
        });
        resultArray.push(productToSave);
    }
});

console.log(resultArray);

Now this actually works, however, it is incredibly slow. Each page takes roughly around 3 - 5 seconds and since I have 8000 pages I have to wait for around 10 hours for it to complete.

So my question is, is there a faster way to do it when we are talking about this many pages?

Upvotes: 0

Views: 959

Answers (1)

Md. Abu Taher
Md. Abu Taher

Reputation: 18826

Subjective solution: Use multiple tabs/pages and split the whole list into 10 or so parts. It will put strain on the CPU/net etc resources but the scraping should be faster. Not to mention the website might probably mark you as spam for browsing 8000 pages in a short time.

To get this working, you will need several different pieces,

  • Here is a snippet to chunk the array into several pieces.
  • Then you can use a new tab for each part or page. Each with it's own Promise returning the result.
  • Finally a database to store all data asynchronously (preferred) or Promise.all() to return result once everything is finished.

It's subjective, and I cannot share the whole solution step by step with code for now but if you put the solution into action, it should be enough.

Upvotes: 1

Related Questions