Simus
Simus

Reputation: 319

Temporizing User Agent rotation in Scrapy

I am writing a crawlspider using Scrapy and I use a downloader middleware to rotate user agents for each request. What I would like to know if there is a way to temporize this. In other words, I would like to know if it is possible to tell the spider to change User Agent every X seconds. I thought that, maybe, using the DOWNLOAD_DELAY setting to do this would do the trick.

Upvotes: 2

Views: 1640

Answers (1)

alecxe
alecxe

Reputation: 473903

You might approach it a bit differently. Since you have control over the requests/sec crawling speed via CONCURRENT_REQUESTS, DOWNLOAD_DELAY and other relevant settings, you might just count how many requests in a row would go with the same User-Agent header.

Something along these lines (based on scrapy-fake-useragent) (not tested):

from fake_useragent import UserAgent

class RotateUserAgentMiddleware(object):
    def __init__(self, settings):
        # let's make it configurable
        self.rotate_user_agent_freq = settings.getint('ROTATE_USER_AGENT_FREQ')

        self.ua = UserAgent()

        self.request_count = 0
        self.current_user_agent = self.ua.random

    def process_request(self, request, spider):
        if self.request_count >= self.rotate_user_agent_freq:
            self.current_user_agent = self.ua.random 
            self.request_count = 0
        else:
            self.request_count += 1

        request.headers.setdefault('User-Agent', self.current_user_agent)

This might be not particularly accurate since there also could be retries and other reasons that can theoretically screw up the count - test it please.

Upvotes: 3

Related Questions