codewithawais
codewithawais

Reputation: 579

Sending data between two parse methods and getting KeyError SCRAPY

I was trying to scrape this link.

https://www.thomasnet.com/suppliers

I want to send Categories names between two parse methods but when the scrapy crawler follows the next page, it gives a KeyError for the category_name.

categories_names = response.request.meta['categories_names']
KeyError: 'categories_names'

How do I get the same category's name while following the next page?

# -*- coding: utf-8 -*-
import scrapy

class MainSpider(scrapy.Spider):
    name = 'main'
    start_urls = ['https://www.thomasnet.com/suppliers']

    def parse(self, response):
        li = response.xpath('//div[@class="titled-list titled-list--covid-19-response-section titled-list--dropdown "]/ul/li/a')
        # li = response.xpath('//div[contains(@class, "titled-list--dropdown")]/ul/li/a')
        for each in li:
            categories_links = each.xpath('.//@href').get()
            categories = each.xpath('.//text()').get()

            yield response.follow(url=categories_links, callback=self.parse_li, meta={"categories_names": categories})


    def parse_li(self, response):
        categories_names = response.request.meta['categories_names']
        rows = response.xpath('//header[@class="profile-card__header"]/parent::div')
        for row in rows:
            links = row.xpath('.//header[@class="profile-card__header"]/h2/a/@href').get()
            company_type = row.xpath('.//span[@data-content="Company Type"]/text()[2]').get()
            yield {
                "Links": links,
                "Categories": categories_names,
                "Company Type": company_type if company_type else "N/A"
            }

        
        next_page = response.xpath('(//*[@class="icon"]/parent::a[@class="page-link"])[2]/@href').get()
        if next_page:
            yield response.follow(url=next_page, callback=self.parse_li)

Upvotes: 0

Views: 145

Answers (2)

You should access the meta attribute from the response object.

categories_names = response.meta['categories_names']

But, the recommended way of doing this right now would be to use cb_kwags.

https://docs.scrapy.org/en/latest/topics/request-response.html?highlight=cb#passing-additional-data-to-callback-functions

Upvotes: 0

renatodvc
renatodvc

Reputation: 2564

I've edited my answer since I misunderstood the issue before.

I believe the problem is that the parse_li will yield new requests recursively, but without assigning the meta params again:

    next_page = response.xpath('(//*[@class="icon"]/parent::a[@class="page-link"])[2]/@href').get()
    if next_page:
        yield response.follow(url=next_page, callback=self.parse_li)

As far as I can tell arbitrary data in meta is not propagated to following requests, so you will need to reassing it:

        yield response.follow(
            url=next_page,
            callback=self.parse_li,
            meta={"categories_names": categories_names}
        )

Consider taking a look at cb_kwargs in the future, they are the recommended param to pass arbitraty data between requests since Scrapy v1.7, you can check it out here. (They work slightly different from meta though)

Upvotes: 2

Related Questions