Reputation: 17
I'm using scraperAPI.com to handle IP rotation for a scraping job I'm working on and I'm trying to implement their new post request method, But I keep receiving a 'HtmlResponse' object has no attribute 'dont_filter' error. Here is the custom start_requests function:
def start_requests(self):
S_API_KEY = {'key':'eifgvaiejfvbailefvbaiefvbialefgilabfva5465461654685312165465134654311'
}
url = "XXXXXXXXXXXXXX.com"
payload={}
headers = {
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'x-requested-with': 'XMLHttpRequest',
'Access-Control-Allow-Origin': '*',
'accept': 'application/json, text/javascript, */*; q=0.01',
'referer': 'XXXXXXXXXXX.com'
}
client = ScraperAPIClient(S_API_KEY['key'])
resp = client.post(url = url, body = payload, headers = headers)
yield HtmlResponse(resp.url, body = resp.text,encoding = 'utf-8')
The weird part is that when I execute this script piecewise in scrapy shell it works fine and returns the proper data, Any insight into this issue would be GREATLY appreciated? currently 4 hours into this problem.
Upvotes: 0
Views: 196
Reputation: 28206
The error you get is caused by returning the wrong type (a Response).
From the docs for start_requests
:
This method must return an iterable with the first Requests to crawl for this spider.
It seems the easiest solution would be using a scrapy request (probably a FormRequest
) to the API url, instead of using ScraperAPIClient.post()
.
You should be able to use ScraperAPIClient.scrapyGet()
to generate the correct url, but I have not tested this.
If you would prefer to continue using the official api library, a slightly more complicated option is Writing your own downloader middleware.
Upvotes: 1