Reputation: 11
I have developed an Actor+PuppeteerCrawler+Proxy based crawler and want to rescrape failed pages. To increase the chance for the rescrape, I want to switch to another proxyUrl. The idea is, to create a new crawler with a modified launchPupperteer function and a different proxyUrl, and re-enque the failed pages. Please check the sample code below.
But unfortunately, it doesn't work, although I reset the request queue by using drop and reopening. Is it possible to rescraped failed pages by using PuppeteerCrawler with a different proxyUrl and how?
Best regards, Wolfgang
for(let retryCount = 0; retryCount <= MAX_RETRY_COUNT; retryCount++){
if(retryCount){
// Try to reset the request queue, so that failed request shell be rescraped
await requestQueue.drop();
requestQueue = await Apify.openRequestQueue(); // this is necessary to avoid exceptions
// Re-enqueue failed urls in array failedUrls >>> ignored although using drop() and reopening request queue!!!
for(let failedUrl of failedUrls){
await requestQueue.addRequest({url: failedUrl});
}
}
crawlerOptions.launchPuppeteerFunction = () => {
return Apify.launchPuppeteer({
// generates a new proxy url and adds it to a new launchPuppeteer function
proxyUrl: createProxyUrl()
});
};
let crawler = new Apify.PuppeteerCrawler(crawlerOptions);
await crawler.run();
}
Upvotes: 0
Views: 974
Reputation: 983
I think your approach should work but on the other hand it should not be necessary. I'm not sure what createProxyUrl
does.
You can supply a generic proxy URL with auto
username which will use all your datacenter proxies at Apify. Or you can provide proxyUrls
directly to PuppeteerCrawler
.
Just don't forget that you have to switch browser to get a new IP from the proxy. More in this article - https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler
Upvotes: 1