knowledge_seeker
knowledge_seeker

Reputation: 937

How to implement the logic of this asynchronous nodejs web scraping program?

I currently have a nodejs based web scraper that utilities the puppetteer module. While it does work, it is very slow, since I have made it in such a way that it uses a synchronous approach instead of an asynchronous one.

The basic logic of the program in pseudo code is as follows:

async fucntion main():
    
   ......

    while true:
        for url in listOfUrls:
            await scrapeInformation()
            if there is a change:
                sendNotification()

The problem with this approach is that I can not begin the scraping of another page until the current page has been scraped. I would like to begin the loading of the next webpages, so that they are ready to be scraped once their turn comes in the for loop. However, I still want to be able to limit the number of webpages open for scraping, so that I do not run into any memory errors, since I ran into that issue in a previous implementation of this script where I was launching instances of the chromium browser much faster than the program was able to close them.

The scrapeInformation() looks a bit like this:

async function scrapeInformation(url, browser) {
    const browser = await puppeteer.launch({headless: true});
    const page = await browser.newPage();
    let response = await page.goto(url);

    let data = await page.evaluate(() => {

        blah blah blah
        
        return {blah, blah};
    });

    await page.close();
    
    return data
}

I believe a good place to start would be to perhaps to rescrape another URL at the let data = await page.evaluate(() => { line, but I am unsure as how to implement such logic.

Upvotes: 1

Views: 364

Answers (1)

vsemozhebuty
vsemozhebuty

Reputation: 13782

If I understand correctly, you need to check some URL set infinitely round with limited concurrency. You need not to open and close browsers for this, it has unneeded overhead. Just create a pool with n pages (where n = concurrency limit) and reuse them with a portion of URLs. You can shift this portion of URLs and push it to the end of set for infinite cycle. For example:

'use strict';

const puppeteer = require('puppeteer');

const urls = [
  'https://example.org/?test=1',
  'https://example.org/?test=2',
  'https://example.org/?test=3',
  'https://example.org/?test=4',
  'https://example.org/?test=5',
  'https://example.org/?test=6',
  'https://example.org/?test=7',
  'https://example.org/?test=8',
  'https://example.org/?test=9',
  'https://example.org/?test=10',
];
const concurrencyLimit = 3;
const restartAfterNCycles = 5;

(async function main() {
  for (;;) await cycles();
})();

async function cycles() {
  try {
    const browser = await puppeteer.launch({ headless: false, defaultViewport: null });

    await Promise.all(Array.from(
      Array(concurrencyLimit - 1), // Because one page is already opened.
      () => browser.newPage()
    ));
    const pagePool = await browser.pages();

    let cycleCounter = restartAfterNCycles;
    while (cycleCounter--) {
      const cycleUrls = urls.slice();
      let urlsPart;
      while ((urlsPart = cycleUrls.splice(0, concurrencyLimit)).length) {
        console.log(`\nProcessing concurrently:\n${urlsPart.join('\n')}\n`);
        await Promise.all(urlsPart.map((url, i) => scrape(pagePool[i], url)));
      }
      console.log(`\nCycles to do: ${cycleCounter}`);
    }

    return browser.close();
  } catch (err) {
    console.error(err);
  }
}

async function scrape(page, url) {
  await page.goto(url);
  const data = await page.evaluate(() => document.location.href);
  console.log(`${data} done.`);
}

Upvotes: 1

Related Questions