Chris
Chris

Reputation: 1317

Scrapy / Python - executing several yields

In my parse method, I'd like to call 3 methods from a SpiderClass that I inherit from. At first, I'd like to parse the XPaths, then clean the data, then assign the data to an item instance and hand it over to the pipeline.

I'll try it with little code and just ask for the principles: cleanData and assignProductValues are never called - why?

def parse(self, response):
    for href in response.xpath("//a[@class='product--title']/@href"):
        url =  href.extract()

        yield scrapy.Request(url, callback=super(MyclassSpider, self).scrapeProduct)
        yield scrapy.Request(url, callback=super(MyclassSpider, self).cleanData)
        yield scrapy.Request(url, callback=super(MyclassSpider, self).assignProductValues)

I understand that I create a generator when using yield but I don't understand why the 2nd and 3rd yield are not being called after the first yield or how I can achieve them being called.

--

Then I tried another way: I don't want to do 3 requests towards the website - just one and work with the data.

def parse(self, response):
    for href in response.xpath("//a[@class='product--title']/@href"):
        url =  href.extract()

        item = MyItem()
        response = scrapy.Request(url, meta={'item': item}, callback=super(MyclassSpider, self).scrapeProduct)
        super(MyclassSpider, self).cleanData(response)
        super(MyclassSpider, self).assignProductValues(response)
        yield response

What happens here is, scrapeProduct is being called, that might take a while. (I've got a 5 seconds delay). But then cleanData and assignProductValues are being called right away like 30 times (as often as the for is true/looped through). How can I execute the three Methods one by one with only 1 request towards the website?

Upvotes: 0

Views: 128

Answers (1)

Tomáš Linhart
Tomáš Linhart

Reputation: 10210

I guess that after you yield the first request, the other two are getting filtered by dupefilter. Check your log. If you don't want it to be filtered, pass dont_filter=True for the Request object.

Upvotes: 1

Related Questions