Reputation: 1317
In my parse method, I'd like to call 3 methods from a SpiderClass that I inherit from. At first, I'd like to parse the XPaths, then clean the data, then assign the data to an item instance and hand it over to the pipeline.
I'll try it with little code and just ask for the principles: cleanData
and assignProductValues
are never called - why?
def parse(self, response):
for href in response.xpath("//a[@class='product--title']/@href"):
url = href.extract()
yield scrapy.Request(url, callback=super(MyclassSpider, self).scrapeProduct)
yield scrapy.Request(url, callback=super(MyclassSpider, self).cleanData)
yield scrapy.Request(url, callback=super(MyclassSpider, self).assignProductValues)
I understand that I create a generator when using yield but I don't understand why the 2nd and 3rd yield are not being called after the first yield or how I can achieve them being called.
--
Then I tried another way: I don't want to do 3 requests towards the website - just one and work with the data.
def parse(self, response):
for href in response.xpath("//a[@class='product--title']/@href"):
url = href.extract()
item = MyItem()
response = scrapy.Request(url, meta={'item': item}, callback=super(MyclassSpider, self).scrapeProduct)
super(MyclassSpider, self).cleanData(response)
super(MyclassSpider, self).assignProductValues(response)
yield response
What happens here is, scrapeProduct
is being called, that might take a while. (I've got a 5 seconds delay).
But then cleanData
and assignProductValues
are being called right away like 30 times (as often as the for is true/looped through).
How can I execute the three Methods one by one with only 1 request towards the website?
Upvotes: 0
Views: 128
Reputation: 10210
I guess that after you yield the first request, the other two are getting filtered by dupefilter. Check your log. If you don't want it to be filtered, pass dont_filter=True
for the Request object.
Upvotes: 1