Reputation: 201
I have a program that needs to scrape several different urls using scrapy and I need it to use the same user agent and IP address for each url. So if I am scraping like 50 urls I need each url to have one unique user agent and ip address that are only used when scraping that url. But the IP address and user agent get changed when the program scrapes the next url.
I have already got it to rotate user agents randomly but now I just need to pair user agents with different urls and use those same user agents with same urls each time. As for the IP addresses I cannot even get it to rotate them randomly let alone pair them with one unique url.
SplashSpider.py
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from ..items import GameItem
class MySpider(Spider):
name = 'splash_spider' # Name of Spider
start_urls = [''] # url(s)
#
#
#
#......
# all the urls I need to scrape, 50+ will go in these lines
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, args={"wait": 3})
#Scraping
def parse(self, response):
item = GameItem()
for game in response.css(""): #loop to go through contents of webpage until all needed info is scrapped
# Card Name
item["card name"] = game.css("").extract_first() #html code corresponding to card name
# Price
item["Price"] = game.css("td.deckdbbody.search_results_9::text").extract_first() #code corresponding to price
yield item
settings.py
SPIDER_MODULES = ['scrapy_javascript.spiders'] NEWSPIDER_MODULE = 'scrapy_javascript.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scrapy_javascript (http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } # -----------------------------------------------------------------------------
# USER AGENT
# -----------------------------------------------------------------------------
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400, }
USER_AGENTS = [
#Chrome
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
#Firefox
'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
]
# -----------------------------------------------------------------------------
# IP ADDRESSES
# -----------------------------------------------------------------------------
PROXY_POOL_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620, }
ROTATING_PROXY_LIST = [
'http://199.89.192.76:8050'
'http://199.89.192.77:8050'
'http://199.89.192.78:8050'
'http://199.89.193.2:8050'
'http://199.89.193.3:8050'
'http://199.89.193.4:8050'
'http://199.89.193.5:8050'
'http://199.89.193.6:8050'
'http://199.89.193.7:8050'
'http://199.89.193.8:8050'
'http://199.89.193.9:8050'
'http://199.89.193.10:8050'
'http://199.89.193.11:8050'
'http://199.89.193.12:8050'
'http://199.89.193.13:8050'
'http://199.89.193.14:8050'
'http://204.152.114.226:8050'
'http://204.152.114.227:8050'
'http://204.152.114.228:8050'
'http://204.152.114.229:8050'
'http://204.152.114.230:8050'
'http://204.152.114.232:8050'
'http://204.152.114.233:8050'
'http://204.152.114.234:8050'
'http://204.152.114.235:8050'
'http://204.152.114.236:8050'
'http://204.152.114.237:8050'
'http://204.152.114.238:8050'
'http://204.152.114.239:8050'
'http://204.152.114.240:8050'
'http://204.152.114.241:8050'
'http://204.152.114.242:8050'
'http://204.152.114.243:8050'
'http://204.152.114.244:8050'
'http://204.152.114.245:8050'
'http://204.152.114.246:8050'
'http://204.152.114.247:8050'
'http://204.152.114.248:8050'
'http://204.152.114.249:8050'
'http://204.152.114.250:8050'
'http://204.152.114.251:8050'
'http://204.152.114.252:8050'
'http://204.152.114.253:8050'
'http://204.152.114.254:8050'
]
SPLASH_URL = 'http://199.89.192.74:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
In the end it should simply pair each of the urls I need to scrape with a ip address and user agent for the lists I have in my settings.py file.
Upvotes: 1
Views: 875
Reputation: 21436
This is bit out of the scope of a simple stackoverflow question.
However the general approach when dealing with customizing requests sent out by a scrapy crawler is to write a downloader middleware[1].
In your example you want to write a downloader middleware that would:
1. Generate profiles on spider start by making a list of `(ip, user-agent)` tuples
2. Make a round-robing (or alternative) queue of these profiles
3. Adjust every sent request with one random profile
Briefly as code it would look like this:
# middlewares.py
import random
from copy import copy
class ProfileMiddleware:
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
mw = cls(crawler, *args, **kwargs)
crawler.signals.connect(mw.spider_opened, signal=signals.spider_opened)
mw.settings = crawler.settings
return mw
def spider_opened(self, spider, **kwargs):
proxies = self.settings.getlist('PROXIES')
user_agents = self.settings.getlist('USER_AGENTS')
self.profiles = list(zip(proxies, user_agents))
self.queue = copy(self.profiles)
random.shuffle(self.queue)
def process_request(self, request, spider):
if not self.queue:
self.queue = copy(self.profiles)
random.shuffle(self.queue)
profile = self.queue.pop()
request.headers['User-Agent'] = profile[1]
request.meta['proxy'] = profile[0]
I haven't tested this, it's just to illustrate general idea
Then activate it, somewhere at the end of middleware chain:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.ProfileMiddleware': 900,
}
PROXIES = ['123', '456'...]
USER_AGENTS = ['firefox', 'chrome'...]
1 - More on scrapy's downloader middlewares: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
Upvotes: 3