Reputation: 19
I created spider that uses scrapy, splash and proxy.
When I execute just 1 spider everything works fine. However when I try to use CrawlerProcess my Spider don't use proxy what leads to fast ban.
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from scrapy.crawler import CrawlerProcess
from my_fake_useragent import UserAgent
ua = UserAgent()
class AdsSpiderSpider2(scrapy.Spider):
name = 'ads_spider'
start_urls = ['https://enqpothya3f4tgj.m.pipedream.net' ]
scritp = '''function main(splash, args)
splash:on_request(function(request)
request:set_proxy{
host = "pl.smartproxy.com",
port = xxxx,
username = xxxx,
password = xxxx,
type = "HTTP"
}
end
)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
'''
def start_requests(self):
for url in self.start_urls:
print(url)
yield SplashRequest(url, self.parse,
endpoint='execute',
args={
'wait': 1,
'lua_source': self.scritp,
'js_source': 'document.body',
'proxy' : 'http://[user:password]@pl.smartproxy.com:[xxxx]'
},
headers = {
'User-Agent' : ua.random(),
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'pl,en;q=0.9,en-GB;q=0.8,en-US;q=0.7',
}
)
def parse(self, response):
print("x")
scrapy crawl ads_spider
However when I try to use CrawlerProcess my Spider don't use proxy what leads to fast ban.
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(AdsSpiderSpider2)
process.start()
BOT_NAME = 'xxxx'
SPIDER_MODULES = ['xxxxx']
NEWSPIDER_MODULE = 'xxxxx'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.5
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://localhost:8050'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
Why using CrawlerProcess makes code don't use proxy ?
Upvotes: 1
Views: 148
Reputation: 2110
You need to pass the settings object explicitly to the CrawlerProcess
constructor i.e.
from scrapy.utils.project import get_project_settings
process = CrawlerProcess()
to process = CrawlerProcess(settings=get_project_settings())
Upvotes: 1