Reputation: 937
I currently have a nodejs based web scraper that utilities the puppetteer module. While it does work, it is very slow, since I have made it in such a way that it uses a synchronous approach instead of an asynchronous one.
The basic logic of the program in pseudo code is as follows:
async fucntion main():
......
while true:
for url in listOfUrls:
await scrapeInformation()
if there is a change:
sendNotification()
The problem with this approach is that I can not begin the scraping of another page until the current page has been scraped. I would like to begin the loading of the next webpages, so that they are ready to be scraped once their turn comes in the for
loop. However, I still want to be able to limit the number of webpages open for scraping, so that I do not run into any memory errors, since I ran into that issue in a previous implementation of this script where I was launching instances of the chromium browser much faster than the program was able to close them.
The scrapeInformation() looks a bit like this:
async function scrapeInformation(url, browser) {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
let response = await page.goto(url);
let data = await page.evaluate(() => {
blah blah blah
return {blah, blah};
});
await page.close();
return data
}
I believe a good place to start would be to perhaps to rescrape another URL at the let data = await page.evaluate(() => {
line, but I am unsure as how to implement such logic.
Upvotes: 1
Views: 364
Reputation: 13782
If I understand correctly, you need to check some URL set infinitely round with limited concurrency. You need not to open and close browsers for this, it has unneeded overhead. Just create a pool with n pages (where n = concurrency limit) and reuse them with a portion of URLs. You can shift this portion of URLs and push it to the end of set for infinite cycle. For example:
'use strict';
const puppeteer = require('puppeteer');
const urls = [
'https://example.org/?test=1',
'https://example.org/?test=2',
'https://example.org/?test=3',
'https://example.org/?test=4',
'https://example.org/?test=5',
'https://example.org/?test=6',
'https://example.org/?test=7',
'https://example.org/?test=8',
'https://example.org/?test=9',
'https://example.org/?test=10',
];
const concurrencyLimit = 3;
const restartAfterNCycles = 5;
(async function main() {
for (;;) await cycles();
})();
async function cycles() {
try {
const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
await Promise.all(Array.from(
Array(concurrencyLimit - 1), // Because one page is already opened.
() => browser.newPage()
));
const pagePool = await browser.pages();
let cycleCounter = restartAfterNCycles;
while (cycleCounter--) {
const cycleUrls = urls.slice();
let urlsPart;
while ((urlsPart = cycleUrls.splice(0, concurrencyLimit)).length) {
console.log(`\nProcessing concurrently:\n${urlsPart.join('\n')}\n`);
await Promise.all(urlsPart.map((url, i) => scrape(pagePool[i], url)));
}
console.log(`\nCycles to do: ${cycleCounter}`);
}
return browser.close();
} catch (err) {
console.error(err);
}
}
async function scrape(page, url) {
await page.goto(url);
const data = await page.evaluate(() => document.location.href);
console.log(`${data} done.`);
}
Upvotes: 1