dido
dido

Reputation: 2361

How can I download json Response with Scrapy?

I am trying to download a page from newegg mobile API via scrapy. I wrote this script but it isn't working. I tried with normal link and the script wrote the response to a file but with the url to newegg mobile API can't write the response to a file.

#spiders/newegg.py

class NeweggSpider(Spider):
    name = 'newegg'
    allowed_domains = ['newegg.com']
    #http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails
    start_urls = ["http://www.newegg.com/Product/Product.aspx?Item=N82E16883282695"]

    meta_page = 'newegg_spider_page'
    meta_url_tpl = 'newegg_url_template'

    def start_requests(self):
            for url in self.start_urls:
             yield Request(url, callback=self.parse_details)

    def parse_details(self, response):
        with open('log.txt', 'w') as f:
             f.write(response.body)

I can't save the response from the own url.

I want to download a json from http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails

I am setting the USER_AGENT in scrapy.cfg:

[settings]
default = neweggs.settings

[deploy]
url = http://localhost:6800/
project = neweggs

USER_AGENT = 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'

Scrapy stats:

2015-10-28 14:46:38 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 777,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 1430,
 'downloader/response_count': 3,
 'downloader/response_status_count/400': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 28, 12, 46, 38, 776000),
 'log_count/DEBUG': 6,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 10, 28, 12, 46, 36, 208000)}
2015-10-28 14:46:38 [scrapy] INFO: Spider closed (finished)

Upvotes: 3

Views: 986

Answers (3)

eLRuLL
eLRuLL

Reputation: 18799

you don't need to use scrapy.cfg for specifying settings, you need to do that on the settings.py file.

settings.py:

...
USER_AGENT = 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'
...

Upvotes: 1

Rejected
Rejected

Reputation: 4491

The link to "http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails" is returning a page with HTTP status 400, which is "Bad Request".

This is the reason you're getting 3 connects, the Scrapy Retry Middleware is retrying the page grab three times before giving up on it. By default, Scrapy will not pass back responses with HTTP status 400 to the spider. If you'd like it to, add handle_httpstatus_list = [400] to the spider.

Upvotes: 1

alecxe
alecxe

Reputation: 473933

Since you are making a request manually in start_requests, you need to explicitly pass User-Agent header with it. Works for me:

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, callback=self.parse_details, headers={"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"})

Upvotes: 1

Related Questions