Reputation: 213
How can I create a new request with my proxy settings active middlewares.py
from urllib.parse import urlencode
from scrapy.http import Request
def get_url(url):
payload = {'api_key': 'my_api', 'url': url}
proxy_url = 'http://api.scraperapi.com/?'+urlencode(payload)
return proxy_url
class MyCustomDownloader:
def process_request(self, request, spider):
url = request.url
return Request(url=get_url(url))
def process_response(self, request, response, spider):
return response
settings.py
DOWNLOADER_MIDDLEWARES = {
'usr157.middlewares.MyCustomDownloader': 543,
}
spider.py
class testSpider(scrapy.Spider):
name = 'test'
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.text)
when I run scrapy crawl test
it gets stuck with not 1 request being made, Ideally what I want is 1 request with my modified URL using get_url
function
Upvotes: 2
Views: 426
Reputation: 1943
as I see you want to use scraperapi
proxy.
This proxy has its own API to work with it.
This is an example from official page https://www.scraperapi.com/documentation
# remember to install the library: pip install scraperapi-sdk
from scraper_api import ScraperAPIClient
client = ScraperAPIClient('YOURAPIKEY')
result = client.get(url = 'http://httpbin.org/ip').text
print(result);
# Scrapy users can simply replace the urls in their start_urls and parse function
# Note for Scrapy, you should not use DOWNLOAD_DELAY and
# RANDOMIZE_DOWNLOAD_DELAY, these will lower your concurrency and are not
# needed with our API
# ...other scrapy setup code
start_urls =[client.scrapyGet(url = 'http://httpbin.org/ip')]
def parse(self, response):
# ...your parsing logic here
yield scrapy.Request(client.scrapyGet(url = 'http://httpbin.org/ip'), self.parse)
to work with middleware you need to male simple changes like this:
def process_request(self, request, spider):
if 'api.scraperapi' not in request.url:
new_url = 'http://api.scraperapi.com/?api_key=YOURAPIKEY&url=' + request.url
request = request.replace(url=new_url)
return request
else:
return None
PS: don't have scraperapi
account to test it properly.
Upvotes: 1
Reputation: 3561
in your implementation there is no callback in your generated request.
On most of scrapy middlewares source code (1, 2, 3) - their process_request
method doesn't return new request object. They modify parameters of original request:
def process_request(self, request, spider)
request.url=get_url(url)
Upvotes: 0