Scraping data from current as well as nested links at the same time using Scrapy

Question

I am fairly new to scraping pages using Scrapy. While trying to scrape quotes along with details of each author from the respective links for them I encountered a problem.

import scrapy

class QuotesProject(scrapy.Spider):
    name = 'quote'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        item = {}
        for x in response.css('.quote'):
            item['quote'] = x.css('.text::text').get()
            item['author'] = x.css('.author::text').get()
            item['href'] = response.urljoin(x.css('a::attr(href)').get())

            yield scrapy.Request(item['href'], callback=self.parse_inside, meta={'item': item})

    def parse_inside(self, response):
        item = response.meta['item']
        item['aauthor'] = response.css('h3::text').get()
        return item

The desired output for each quote is as follows, where author and aauthor should have the same value(but aauthor is fetched from another page):

{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin'}

However I'm getting quite unexpected output

2019-04-04 15:45:52 [scrapy.core.engine] INFO: Spider opened
2019-04-04 15:45:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-04 15:45:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-04-04 15:45:53 [scrapy.core.engine] DEBUG: Crawled (404)  (referer: None)
2019-04-04 15:45:53 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2019-04-04 15:45:53 [scrapy.dupefilters] DEBUG: Filtered duplicate request:  - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2019-04-04 15:45:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to  from 
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to  from 
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to  from 
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to  from 
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to  from 
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to  from 
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to  from 
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to  from 
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Andre-Gide/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'André Gide
    '}
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/J-K-Rowling/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'J.K. Rowling
    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Jane-Austen/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Eleanor Roosevelt
    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Eleanor-Roosevelt/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Marilyn Monroe
    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Albert-Einstein/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin
    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Marilyn-Monroe/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin
    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Steve-Martin/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin
    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Thomas-A-Edison/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Thomas A. Edison
    '}

It seems to complete all iterations of the parse() method and use the last item dictionary for later links. But if that's the case, all the aauthor values should have been the same. I searched a lot for the solution, but everything was beyond what I could understand at this point.Also, the requests seem to be asynchronous.

Would appreciate if someone explains the problem along with a working solution

vezunchik · Accepted Answer

Your code is good, just move item creation to cycle, otherwise this is the same object with the same data:

def parse(self, response):
    for x in response.css('.quote'):
        item = {}
        item['quote'] = x.css('.text::text').get()
        item['author'] = x.css('.author::text').get()
        item['href'] = response.urljoin(x.css('a::attr(href)').get())
        yield scrapy.Request(item['href'], callback=self.parse_inside, meta={'item': item})

Scraping data from current as well as nested links at the same time using Scrapy

Answers (1)

Related Questions