Reputation: 2361
I am trying to download a page from newegg mobile API via scrapy. I wrote this script but it isn't working. I tried with normal link and the script wrote the response to a file but with the url to newegg mobile API can't write the response to a file.
class NeweggSpider(Spider):
name = 'newegg'
allowed_domains = ['newegg.com']
#http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails
start_urls = ["http://www.newegg.com/Product/Product.aspx?Item=N82E16883282695"]
meta_page = 'newegg_spider_page'
meta_url_tpl = 'newegg_url_template'
def start_requests(self):
for url in self.start_urls:
yield Request(url, callback=self.parse_details)
def parse_details(self, response):
with open('log.txt', 'w') as f:
f.write(response.body)
I can't save the response from the own url.
I want to download a json from http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails
I am setting the USER_AGENT
in scrapy.cfg
:
[settings]
default = neweggs.settings
[deploy]
url = http://localhost:6800/
project = neweggs
USER_AGENT = 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'
Scrapy stats:
2015-10-28 14:46:38 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 777,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 1430,
'downloader/response_count': 3,
'downloader/response_status_count/400': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 10, 28, 12, 46, 38, 776000),
'log_count/DEBUG': 6,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 10, 28, 12, 46, 36, 208000)}
2015-10-28 14:46:38 [scrapy] INFO: Spider closed (finished)
Upvotes: 3
Views: 986
Reputation: 18799
you don't need to use scrapy.cfg
for specifying settings, you need to do that on the settings.py
file.
settings.py:
...
USER_AGENT = 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'
...
Upvotes: 1
Reputation: 4491
The link to "http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails" is returning a page with HTTP status 400, which is "Bad Request".
This is the reason you're getting 3 connects, the Scrapy Retry Middleware is retrying the page grab three times before giving up on it. By default, Scrapy will not pass back responses with HTTP status 400 to the spider. If you'd like it to, add handle_httpstatus_list = [400]
to the spider.
Upvotes: 1
Reputation: 473933
Since you are making a request manually in start_requests
, you need to explicitly pass User-Agent header with it. Works for me:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse_details, headers={"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"})
Upvotes: 1