Reputation: 1587
I'm trying to get the price of a tool from Castorama website. But so far I have problem with construction of the proper request.
http://www.castorama.pl/produkty/narzedzia-i-artykuly/elektronarzedzia-przenosne-i-akcesoria/szlifierki-i-polerki/szlifierki-oscylacyjne/szlifierka-oscylacyjna-pp-110w.html
Unfortunately this is not so easy. Price depends from localisation of a shop. Before obtaining the price you need to define your shop localisation. On website I click 'ZOBACZ CENĘ'
(a yellow box on the right). Later I fill in my zip code in the middle field f.e. '05-123'
and click 'SZUKAJ PO KODZIE'
button on the right. And at the end I click yellow pop up button of 'USTAW'
in the pop up box.
Thanks to it I am getting desired price of the product. I would like to replicate this behavior with scrappy. To do so in Chrome developer tool I checked network tab and XHR tab to identify request responsible for getting the price. And I think that the proper one is 'getProductPriceStockByStore/'
.
Request
URL:http://www.castorama.pl/bold_all/data/getProductPriceStockByStore/
Request Method:POST
Status Code:200 OK
Remote Address:109.205.50.98:80
Request Headers
Accept:text/javascript, text/html, application/xml, text/xml, */*
Accept-Encoding:gzip, deflate
Accept-Language:en-GB,en;q=0.8,pl;q=0.6
Connection:keep-alive
Content-Length:39
Content-type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:selected_shop_flag=3; CACHED_FRONT_FORM_KEY=2MxQx5N1GeBOoDFl; localizationPopup=1; selected_shop=1; selected_shop_store_view=8002; bold_wishlist=3lg7qtm3teba7s1sbfg77hi352; frontend=3lg7qtm3teba7s1sbfg77hi352; VIEWED_PRODUCT_IDS=30052; cSID_VM=1460629378710; _ga=GA1.2.91284606.1460626559; _ceg.s=o5mcub; _ceg.u=o5mcub; _dc_gtm_UA-27193958-1=1
Host:www.castorama.pl
Origin:http://www.castorama.pl
Referer:http://www.castorama.pl/produkty/narzedzia-i-artykuly/elektronarzedzia-przenosne-i-akcesoria/szlifierki-i-polerki/szlifierki-oscylacyjne/szlifierka-oscylacyjna-pp-110w.html
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/49.0.2623.108 Chrome/49.0.2623.108 Safari/537.36
Form data:
isAjax:true
product_id:30052
store:8002
Response:
{"products":{"30052":{"price":"93.98","qty":"7.00","stock_status":1,"html":"in"}},"store":"8002","templates":{"in":"<span><span class=\"in-stock\">Dost\u0119pny<\/span><\/span>","out":"<span><span class=\"out-of-stock\">Niedost\u0119pny<\/span><\/span>","phone":"<span><span class=\"low-stock\">Na zam\u00f3wienie<\/span><\/span>","backorder":"<span><span class=\"backorder-stock\">Na zam\u00f3wienie<\/span><\/span>"},"status":true}
So I moved to Scrapy to implement solution for that problem. I decided to create post request with cookie attached to request with headers similar to this from above:
import scrapy
from Castorama.items import CastoramaItem
class DmozSpider(scrapy.Spider):
name = "Castorama"
allowed_domains = ["castorama.pl"]
start_urls = ["http://www.castorama.pl/bold_all/data/getProductPriceStockByStore/"]
def start_Request(self):
req=scrapy.Request(start_urls[0]
, method='POST'
, cookies ={'selected_shop_flag':3,
'CACHED_FRONT_FORM_KEY':'2MxQx5N1GeBOoDFl',
'selected_shop':1,
'selected_shop_flag':3,
'selected_shop_store_view':8002,
'VIEWED_PRODUCT_IDS':30052,
'frontend':'3lg7qtm3teba7s1sbfg77hi352',
'cSID_VM':1460626558358}
,callback='Rozkoduj'
)
yield req
def Rozkoduj(self, response):
print response.body
But I'm quite unsuccessful with this code. My console log:
2016-04-14 12:54:09 [scrapy] INFO: Scrapy 1.0.5 started (bot: Castorama)
2016-04-14 12:54:09 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-04-14 12:54:09 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'Castorama.spiders', 'SPIDER_MODULES': ['Castorama.spiders'], 'BOT_NAME': 'Castorama'}
2016-04-14 12:54:09 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-14 12:54:09 [boto] DEBUG: Retrieving credentials from metadata server.
2016-04-14 12:54:10 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/home/michal/anaconda2/lib/python2.7/site-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/home/michal/anaconda2/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/home/michal/anaconda2/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/home/michal/anaconda2/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/home/michal/anaconda2/lib/python2.7/urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/home/michal/anaconda2/lib/python2.7/urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-04-14 12:54:10 [boto] ERROR: Unable to read instance data, giving up
2016-04-14 12:54:10 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-14 12:54:10 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-14 12:54:10 [scrapy] INFO: Enabled item pipelines:
2016-04-14 12:54:10 [scrapy] INFO: Spider opened
2016-04-14 12:54:10 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-14 12:54:10 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-14 12:54:10 [scrapy] DEBUG: Crawled (200) <GET http://www.castorama.pl/bold_all/data/getProductPriceStockByStore/> (referer: None)
2016-04-14 12:54:10 [scrapy] ERROR: Spider error processing <GET http://www.castorama.pl/bold_all/data/getProductPriceStockByStore/> (referer: None)
Traceback (most recent call last):
File "/home/michal/anaconda2/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/michal/anaconda2/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 76, in parse
raise NotImplementedError
NotImplementedError
2016-04-14 12:54:10 [scrapy] INFO: Closing spider (finished)
2016-04-14 12:54:10 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 256,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 311,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 14, 10, 54, 10, 776463),
'log_count/DEBUG': 3,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2016, 4, 14, 10, 54, 10, 477689)}
2016-04-14 12:54:10 [scrapy] INFO: Spider closed (finished)
And here is my final questions. Is my approach proper? Should I try to attach this cookie to the request as in above code? Or I should try do it completely different way. And eventually if I'm going in the right direction what should I change in my code to create proper request?
Thank you in advance for any help.
And updated version of spider after Pawel Miech corrections. It is better because the request work but still I don't get proper response.
import scrapy
from Castorama.items import CastoramaItem
class DmozSpider(scrapy.Spider):
name = "Castorama"
allowed_domains = ["castorama.pl"]
start_urls=['http://www.castorama.pl']
def parse(self, response):
start_urls = ["http://www.castorama.pl/bold_all/data/getProductPriceStockByStore/"]
req=scrapy.Request(start_urls[0]
, method='POST'
, cookies ={'selected_shop_flag':3,
'CACHED_FRONT_FORM_KEY':'2MxQx5N1GeBOoDFl',
'selected_shop':1,
'selected_shop_flag':3,
'selected_shop_store_view':8002,
'VIEWED_PRODUCT_IDS':30052,
'frontend':'3lg7qtm3teba7s1sbfg77hi352',
'cSID_VM':1460626558358}
,callback=self.rozkoduj
)
yield req
def rozkoduj(self, response):
print '>>>>>>>>>'
print response.body
Upvotes: 2
Views: 857
Reputation: 7822
Scrapy requests are asynchronous. Every request must have callback. If there is no callback it is set to spider.parse
method. If there is no spider.parse
method you get NotImplementedError
which you are seeing in this stacktrace:
Traceback (most recent call last):
File "/home/michal/anaconda2/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/michal/anaconda2/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 76, in parse
raise NotImplementedError
NotImplementedError
so start with adding proper callback to your POST (it must be reference to spider method and not string, e.g. self.rozkoduj and not "Rozkoduj").
Urlopen error is sent from boto and is issued if you dont have s3 configured, it is ugly but can probably be ignored until someone fixes this ticket in Scrapy core.
And here is my final questions. Is my approach proper? Should I try to attach this cookie to the request as in above code?
The answer to this is as usual: it depends. If you only care about sending cookie for one request your approach is correct. BUT if you want to have those cookies in ALL requests sent from spiders, including requests issued from callback to your POST then you must add cookie to cookiejar. Setting cookie in cookiejar is unfortunantely not very easy, there is ticket for making this simpler here: https://github.com/scrapy/scrapy/issues/1878
In nutshell setting cookie to cookiejar is matter of doing something along the lines of (this is all pseudocode and pointers only):
# must be cookielib.Cookie object
# must pass all kwargs for cookielib.Cookie
cookie = Cookie(**kwargs)
# get cookiejar object it is stored in cookie middleware
all_mw = spider.crawler.engine.downloader.middleware.middlewares
# find cookie middleware there
cookiejar = cookie_middleware._cookiejars[None]
cookiejar.set_cookie(cookie)
Upvotes: 1