bharvey
bharvey

Reputation: 17

HtmlResponse working in Scrapy Shell, but not in script?

I'm using scraperAPI.com to handle IP rotation for a scraping job I'm working on and I'm trying to implement their new post request method, But I keep receiving a 'HtmlResponse' object has no attribute 'dont_filter' error. Here is the custom start_requests function:

def start_requests(self):
    S_API_KEY = {'key':'eifgvaiejfvbailefvbaiefvbialefgilabfva5465461654685312165465134654311'
             }
    url = "XXXXXXXXXXXXXX.com"
    payload={}
    headers = {
       'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
       'x-requested-with': 'XMLHttpRequest',
       'Access-Control-Allow-Origin': '*',
       'accept': 'application/json, text/javascript, */*; q=0.01',
       'referer': 'XXXXXXXXXXX.com'
       }
    client = ScraperAPIClient(S_API_KEY['key'])
    resp = client.post(url = url, body = payload, headers = headers)
    yield HtmlResponse(resp.url, body = resp.text,encoding = 'utf-8')

The weird part is that when I execute this script piecewise in scrapy shell it works fine and returns the proper data, Any insight into this issue would be GREATLY appreciated? currently 4 hours into this problem.

Notes:

Upvotes: 0

Views: 196

Answers (1)

stranac
stranac

Reputation: 28206

The error you get is caused by returning the wrong type (a Response).
From the docs for start_requests:

This method must return an iterable with the first Requests to crawl for this spider.

It seems the easiest solution would be using a scrapy request (probably a FormRequest) to the API url, instead of using ScraperAPIClient.post().
You should be able to use ScraperAPIClient.scrapyGet() to generate the correct url, but I have not tested this.

If you would prefer to continue using the official api library, a slightly more complicated option is Writing your own downloader middleware.

Upvotes: 1

Related Questions