zwl1619
zwl1619

Reputation: 4232

Scrapy : How to write a UserAgentMiddleware?

I want to write a UserAgentMiddleware for scrapy,
the docs says:

Middleware that allows spiders to override the default user agent. In order for a spider to override the default user agent, its user_agent attribute must be set.

docs: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.useragent

But there is no a example,I have no ideas how to write it.
Any suggestions?

Upvotes: 1

Views: 1196

Answers (2)

eusid
eusid

Reputation: 769

First visit some website and get some of the newest user agents. Then in your standard middleware do something like this. This is the same place you would setup your own proxy settings. Grab a random UA from the text file, and put it in the headers. This is sloppy to show an example you would want to import random at the top and also make sure to closer useragents.txt when you are done with it. I would probably just load them into a list at the top of the document.

class GdataDownloaderMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        user_agents = open('useragents.txt', 'r')
        user_agents = user_agents.readlines()
        import random
        user_agent = random.choice(user_agents)
        request.headers.setdefault(b'User-Agent', user_agent)

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

Upvotes: 0

Tarun Lalwani
Tarun Lalwani

Reputation: 146610

You look at it in install scrapy path

/Users/tarun.lalwani/.virtualenvs/project/lib/python3.6/site-packages/scrapy/downloadermiddlewares/useragent.py

"""Set User-Agent header per spider or use a default value from settings"""

from scrapy import signals


class UserAgentMiddleware(object):
    """This middleware allows spiders to override the user_agent"""

    def __init__(self, user_agent='Scrapy'):
        self.user_agent = user_agent

    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler.settings['USER_AGENT'])
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider_opened(self, spider):
        self.user_agent = getattr(spider, 'user_agent', self.user_agent)

    def process_request(self, request, spider):
        if self.user_agent:
            request.headers.setdefault(b'User-Agent', self.user_agent)

You can see a below example for setting Random user agent

https://github.com/alecxe/scrapy-fake-useragent/blob/master/scrapy_fake_useragent/middleware.py

Upvotes: 3

Related Questions