goku
goku

Reputation: 213

create a new request in downlaoder middleware

How can I create a new request with my proxy settings active middlewares.py

    from urllib.parse import urlencode
    from scrapy.http import Request
    def get_url(url):
        payload = {'api_key': 'my_api', 'url': url}
        proxy_url = 'http://api.scraperapi.com/?'+urlencode(payload)
        return proxy_url
    
    class MyCustomDownloader:
    
        def process_request(self, request, spider):
            url = request.url
            return Request(url=get_url(url))
    
        def process_response(self, request, response, spider):
            return response

settings.py

    DOWNLOADER_MIDDLEWARES = {
        'usr157.middlewares.MyCustomDownloader': 543,
    }

spider.py

    class testSpider(scrapy.Spider):
        name = 'test'
        start_urls = ['http://httpbin.org/ip']
        def parse(self, response):
            print(response.text)

when I run scrapy crawl test it gets stuck with not 1 request being made, Ideally what I want is 1 request with my modified URL using get_url function

Upvotes: 2

Views: 426

Answers (2)

Roman
Roman

Reputation: 1943

as I see you want to use scraperapi proxy. This proxy has its own API to work with it. This is an example from official page https://www.scraperapi.com/documentation

# remember to install the library: pip install scraperapi-sdk
from scraper_api import ScraperAPIClient
client = ScraperAPIClient('YOURAPIKEY')
result = client.get(url = 'http://httpbin.org/ip').text
print(result);
# Scrapy users can simply replace the urls in their start_urls and parse function
# Note for Scrapy, you should not use DOWNLOAD_DELAY and
# RANDOMIZE_DOWNLOAD_DELAY, these will lower your concurrency and are not
# needed with our API

# ...other scrapy setup code
start_urls =[client.scrapyGet(url = 'http://httpbin.org/ip')]
def parse(self, response):

# ...your parsing logic here
yield scrapy.Request(client.scrapyGet(url = 'http://httpbin.org/ip'), self.parse)

to work with middleware you need to male simple changes like this:

def process_request(self, request, spider):
    if 'api.scraperapi' not in request.url:
        new_url = 'http://api.scraperapi.com/?api_key=YOURAPIKEY&url=' + request.url
        request = request.replace(url=new_url)
        return request
    else:
        return None

PS: don't have scraperapi account to test it properly.

Upvotes: 1

Georgiy
Georgiy

Reputation: 3561

in your implementation there is no callback in your generated request.

On most of scrapy middlewares source code (1, 2, 3) - their process_request method doesn't return new request object. They modify parameters of original request:

def process_request(self, request, spider)
    request.url=get_url(url)

Upvotes: 0

Related Questions