David Goldfarb
David Goldfarb

Reputation: 13

How to get data from a later function in scrapy

I'm having trouble structuring scrapy data as I want. My spider get some data from one page, then follows a list of links on that page to get a link of this next page.

    def parse_page(self, response):
 
        links = response.css(LINK_SELECTOR).extract()

        data = {
            'name': response.css(NAME_SELECTOR).extract_first(),
            'date': response.css(DATE_SELECTOR).extract(),
        }

        for link in links:
            next_link = response.urljoin(link)
            yield scrapy.Request(next_link, callback=self.parse_url, meta={'data': data})

    def parse_url(self, response):
        data = response.meta['data']
        data['url'] = response.css(a::attr(href)').get()
        yield data

What I would like is to get the data with the following structure:

{'name': name, 'date': date, 'url': [url1, url2, url3, url4]}

Instead of

{'name': name, 'date': date, 'url': url1}
{'name': name, 'date': date, 'url': url2}
{'name': name, 'date': date, 'url': url3}
{'name': name, 'date': date, 'url': url4}

I've tried to use items but I don't get how to pass the data from parse_url to the parse_page function. How would I do that?

Thanks in advance.

Upvotes: 1

Views: 182

Answers (2)

SIM
SIM

Reputation: 22440

The following is one of the ways how you can achieve that. There is a library inline_requests which will help you get the expected output.

import scrapy
from scrapy.crawler import CrawlerProcess
from inline_requests import inline_requests

class YellowpagesSpider(scrapy.Spider):
    name = "yellowpages"
    start_urls = ["https://www.yellowpages.com/san-francisco-ca/mip/honey-honey-cafe-crepery-4752771"]

    @inline_requests
    def parse(self, response):
        data = {
            'name':response.css(".sales-info > h1::text").get(),
            'phone':response.css(".contact > p.phone::text").get(),
            'target_link':[]
        }
        for item_link in response.css(".review-info > a.author[href]::attr(href)").getall():
            resp = yield scrapy.Request(response.urljoin(item_link), meta={'handle_httpstatus_all': True})
            target_link = resp.css("a.review-business-name::attr(href)").get()
            data['target_link'].append(target_link)

        print(data)

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT':'Mozilla/5.0',
        'LOG_LEVEL':'ERROR',
    })
    c.crawl(YellowpagesSpider)
    c.start()

Output it produces:

{'name': 'Honey Honey Cafe & Crepery', 'phone': '(415) 351-2423', 'target_link': ['/san-francisco-ca/mip/honey-honey-cafe-crepery-4752771', '/walnut-ca/mip/akasaka-japanese-cuisine-455476824', '/san-francisco-ca/mip/honey-honey-cafe-crepery-4752771']}

Upvotes: 0

stranac
stranac

Reputation: 28256

You can use scrapy's coroutine support to do this pretty easily.

The code would look something like this:

async def parse_page(self, response):
    ...
    for link in links:
        request = response.follow(link)
        response = await self.crawler.engine.download(request, self)
        urls.append(response.css('a::attr(href)').get())

Upvotes: 2

Related Questions