Reputation: 20555
So i am attempting to crawl webshops to get a specific id (the so called EAM id) for all of their products. To do this i am using Puppeteer.
i created the following function:
async function asyncForEach(array, callback) {
for (let index = 0; index < array.length; index++) {
console.log("Iterating through array " + index + " Of " +array.length);
await callback(array[index], index, array)
}
}
Now then i created the following script:
await asyncForEach(productsToview, async function (productPage,index, arr) {
if (productPage.indexOf("url") >= 0) {
await page.goto(productPage);
await page.waitForSelector('#site-wrapper');
await page.click('#product-read-more-specs');
await page.click('#tab-specs-trigger');
const productToSave = await page.evaluate(() => {
const $ = window.$;
let product = {
title: $('.product-title').text(),
EAN: $('.spec-section').last().find('td').last().text(),
price: $('.product-price-container').text().replace(/\s/g, '')
};
return product;
});
resultArray.push(productToSave);
}
});
console.log(resultArray);
Now this actually works, however, it is incredibly slow. Each page takes roughly around 3 - 5 seconds and since I have 8000 pages I have to wait for around 10 hours for it to complete.
So my question is, is there a faster way to do it when we are talking about this many pages?
Upvotes: 0
Views: 959
Reputation: 18826
Subjective solution: Use multiple tabs/pages and split the whole list into 10 or so parts. It will put strain on the CPU/net etc resources but the scraping should be faster. Not to mention the website might probably mark you as spam for browsing 8000 pages in a short time.
To get this working, you will need several different pieces,
Promise.all()
to return result once everything is finished.It's subjective, and I cannot share the whole solution step by step with code for now but if you put the solution into action, it should be enough.
Upvotes: 1