Reputation: 4232
I want to write a UserAgentMiddleware for scrapy,
the docs says:
Middleware that allows spiders to override the default user agent. In order for a spider to override the default user agent, its user_agent attribute must be set.
But there is no a example,I have no ideas how to write it.
Any suggestions?
Upvotes: 1
Views: 1196
Reputation: 769
First visit some website and get some of the newest user agents. Then in your standard middleware do something like this. This is the same place you would setup your own proxy settings. Grab a random UA from the text file, and put it in the headers. This is sloppy to show an example you would want to import random at the top and also make sure to closer useragents.txt when you are done with it. I would probably just load them into a list at the top of the document.
class GdataDownloaderMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
user_agents = open('useragents.txt', 'r')
user_agents = user_agents.readlines()
import random
user_agent = random.choice(user_agents)
request.headers.setdefault(b'User-Agent', user_agent)
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
Upvotes: 0
Reputation: 146610
You look at it in install scrapy path
/Users/tarun.lalwani/.virtualenvs/project/lib/python3.6/site-packages/scrapy/downloadermiddlewares/useragent.py
"""Set User-Agent header per spider or use a default value from settings"""
from scrapy import signals
class UserAgentMiddleware(object):
"""This middleware allows spiders to override the user_agent"""
def __init__(self, user_agent='Scrapy'):
self.user_agent = user_agent
@classmethod
def from_crawler(cls, crawler):
o = cls(crawler.settings['USER_AGENT'])
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
return o
def spider_opened(self, spider):
self.user_agent = getattr(spider, 'user_agent', self.user_agent)
def process_request(self, request, spider):
if self.user_agent:
request.headers.setdefault(b'User-Agent', self.user_agent)
You can see a below example for setting Random user agent
https://github.com/alecxe/scrapy-fake-useragent/blob/master/scrapy_fake_useragent/middleware.py
Upvotes: 3