Sending data between two parse methods and getting KeyError SCRAPY

Question

I was trying to scrape this link.

https://www.thomasnet.com/suppliers

I want to send Categories names between two parse methods but when the scrapy crawler follows the next page, it gives a KeyError for the category_name.

categories_names = response.request.meta['categories_names']
KeyError: 'categories_names'

How do I get the same category's name while following the next page?

# -*- coding: utf-8 -*-
import scrapy

class MainSpider(scrapy.Spider):
    name = 'main'
    start_urls = ['https://www.thomasnet.com/suppliers']

    def parse(self, response):
        li = response.xpath('//div[@class="titled-list titled-list--covid-19-response-section titled-list--dropdown "]/ul/li/a')
        # li = response.xpath('//div[contains(@class, "titled-list--dropdown")]/ul/li/a')
        for each in li:
            categories_links = each.xpath('.//@href').get()
            categories = each.xpath('.//text()').get()

            yield response.follow(url=categories_links, callback=self.parse_li, meta={"categories_names": categories})


    def parse_li(self, response):
        categories_names = response.request.meta['categories_names']
        rows = response.xpath('//header[@class="profile-card__header"]/parent::div')
        for row in rows:
            links = row.xpath('.//header[@class="profile-card__header"]/h2/a/@href').get()
            company_type = row.xpath('.//span[@data-content="Company Type"]/text()[2]').get()
            yield {
                "Links": links,
                "Categories": categories_names,
                "Company Type": company_type if company_type else "N/A"
            }

        
        next_page = response.xpath('(//*[@class="icon"]/parent::a[@class="page-link"])[2]/@href').get()
        if next_page:
            yield response.follow(url=next_page, callback=self.parse_li)

renatodvc · Accepted Answer

I've edited my answer since I misunderstood the issue before.

I believe the problem is that the parse_li will yield new requests recursively, but without assigning the meta params again:

    next_page = response.xpath('(//*[@class="icon"]/parent::a[@class="page-link"])[2]/@href').get()
    if next_page:
        yield response.follow(url=next_page, callback=self.parse_li)

As far as I can tell arbitrary data in meta is not propagated to following requests, so you will need to reassing it:

        yield response.follow(
            url=next_page,
            callback=self.parse_li,
            meta={"categories_names": categories_names}
        )

Consider taking a look at cb_kwargs in the future, they are the recommended param to pass arbitraty data between requests since Scrapy v1.7, you can check it out here. (They work slightly different from meta though)

Sending data between two parse methods and getting KeyError SCRAPY

Answers (2)

Related Questions