Scrapy / Python - executing several yields

Question

In my parse method, I'd like to call 3 methods from a SpiderClass that I inherit from. At first, I'd like to parse the XPaths, then clean the data, then assign the data to an item instance and hand it over to the pipeline.

I'll try it with little code and just ask for the principles: cleanData and assignProductValues are never called - why?

def parse(self, response):
    for href in response.xpath("//a[@class='product--title']/@href"):
        url =  href.extract()

        yield scrapy.Request(url, callback=super(MyclassSpider, self).scrapeProduct)
        yield scrapy.Request(url, callback=super(MyclassSpider, self).cleanData)
        yield scrapy.Request(url, callback=super(MyclassSpider, self).assignProductValues)

I understand that I create a generator when using yield but I don't understand why the 2nd and 3rd yield are not being called after the first yield or how I can achieve them being called.

--

Then I tried another way: I don't want to do 3 requests towards the website - just one and work with the data.

def parse(self, response):
    for href in response.xpath("//a[@class='product--title']/@href"):
        url =  href.extract()

        item = MyItem()
        response = scrapy.Request(url, meta={'item': item}, callback=super(MyclassSpider, self).scrapeProduct)
        super(MyclassSpider, self).cleanData(response)
        super(MyclassSpider, self).assignProductValues(response)
        yield response

What happens here is, scrapeProduct is being called, that might take a while. (I've got a 5 seconds delay). But then cleanData and assignProductValues are being called right away like 30 times (as often as the for is true/looped through). How can I execute the three Methods one by one with only 1 request towards the website?

Scrapy / Python - executing several yields

Answers (1)

Related Questions