Szymon
Szymon

Reputation: 19

Scrapy CrawlerProcess don't use Proxy

I created spider that uses scrapy, splash and proxy.

When I execute just 1 spider everything works fine. However when I try to use CrawlerProcess my Spider don't use proxy what leads to fast ban.

Spider Code

# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from scrapy.crawler import CrawlerProcess
from my_fake_useragent import UserAgent
ua = UserAgent()


class AdsSpiderSpider2(scrapy.Spider):
    name = 'ads_spider'
    start_urls = ['https://enqpothya3f4tgj.m.pipedream.net' ]

    scritp = '''function main(splash, args)
            splash:on_request(function(request)
                request:set_proxy{
                host = "pl.smartproxy.com",
                port = xxxx,
                username = xxxx,
                password = xxxx,
                type = "HTTP"
                }
            end
            )
            assert(splash:go(args.url))
            assert(splash:wait(0.5))
            
            return {
                html = splash:html(),
                png = splash:png(),
                har = splash:har(),
            }
            end
    '''

    def start_requests(self):
        for url in self.start_urls:
            print(url)
            yield SplashRequest(url, self.parse,
                endpoint='execute',
                args={
                    'wait': 1,
                    'lua_source': self.scritp,
                    'js_source': 'document.body',
                    'proxy' : 'http://[user:password]@pl.smartproxy.com:[xxxx]'
                    },
                headers = {
                    'User-Agent' : ua.random(), 
                    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
                    'accept-language': 'pl,en;q=0.9,en-GB;q=0.8,en-US;q=0.7',
                        }
                    )


    def parse(self, response):
         print("x")

Terminal

scrapy crawl ads_spider

CrawlerProcess

However when I try to use CrawlerProcess my Spider don't use proxy what leads to fast ban.

from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(AdsSpiderSpider2)
process.start()

settings.py

BOT_NAME = 'xxxx'

SPIDER_MODULES = ['xxxxx']
NEWSPIDER_MODULE = 'xxxxx'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.5
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPLASH_URL = 'http://localhost:8050'

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

Why using CrawlerProcess makes code don't use proxy ?

Upvotes: 1

Views: 148

Answers (1)

msenior_
msenior_

Reputation: 2110

You need to pass the settings object explicitly to the CrawlerProcess constructor i.e.

  1. Add this import to the spider file from scrapy.utils.project import get_project_settings
  2. Change the line process = CrawlerProcess() to process = CrawlerProcess(settings=get_project_settings())

Upvotes: 1

Related Questions