Retry failed pages with new proxyUrl

I have developed an Actor+PuppeteerCrawler+Proxy based crawler and want to rescrape failed pages. To increase the chance for the rescrape, I want to switch to another proxyUrl. The idea is, to create a new crawler with a modified launchPupperteer function and a different proxyUrl, and re-enque the failed pages. Please check the sample code below.

But unfortunately, it doesn't work, although I reset the request queue by using drop and reopening. Is it possible to rescraped failed pages by using PuppeteerCrawler with a different proxyUrl and how?

Best regards, Wolfgang

for(let retryCount = 0; retryCount <= MAX_RETRY_COUNT; retryCount++){

    if(retryCount){
        // Try to reset the request queue, so that failed request shell be rescraped
        await requestQueue.drop();
        requestQueue = await Apify.openRequestQueue();   // this is necessary to avoid exceptions
        // Re-enqueue failed urls in array failedUrls >>> ignored although using drop() and reopening request queue!!!
        for(let failedUrl of failedUrls){
            await requestQueue.addRequest({url: failedUrl});
        }
    }

    crawlerOptions.launchPuppeteerFunction = () => {
        return Apify.launchPuppeteer({
            // generates a new proxy url and adds it to a new launchPuppeteer function
            proxyUrl: createProxyUrl()
        });
    };

    let crawler = new Apify.PuppeteerCrawler(crawlerOptions);
    await crawler.run();

}

Upvotes: 0

Views: 974

Answers (1)

Luk&#225;š Křivka
Luk&#225;š Křivka

Reputation: 983

I think your approach should work but on the other hand it should not be necessary. I'm not sure what createProxyUrl does.

You can supply a generic proxy URL with auto username which will use all your datacenter proxies at Apify. Or you can provide proxyUrls directly to PuppeteerCrawler.

Just don't forget that you have to switch browser to get a new IP from the proxy. More in this article - https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler

Upvotes: 1

Related Questions