dkhost07
dkhost07

Reputation: 322

Scraping data from current as well as nested links at the same time using Scrapy

I am fairly new to scraping pages using Scrapy. While trying to scrape quotes along with details of each author from the respective links for them I encountered a problem.

import scrapy

class QuotesProject(scrapy.Spider):
    name = 'quote'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        item = {}
        for x in response.css('.quote'):
            item['quote'] = x.css('.text::text').get()
            item['author'] = x.css('.author::text').get()
            item['href'] = response.urljoin(x.css('a::attr(href)').get())

            yield scrapy.Request(item['href'], callback=self.parse_inside, meta={'item': item})

    def parse_inside(self, response):
        item = response.meta['item']
        item['aauthor'] = response.css('h3::text').get()
        return item

The desired output for each quote is as follows, where author and aauthor should have the same value(but aauthor is fetched from another page):

{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin'}

However I'm getting quite unexpected output

2019-04-04 15:45:52 [scrapy.core.engine] INFO: Spider opened
2019-04-04 15:45:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-04 15:45:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-04-04 15:45:53 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-04 15:45:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2019-04-04 15:45:53 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/author/Albert-Einstein> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2019-04-04 15:45:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Andre-Gide/> from <GET http://quotes.toscrape.com/author/Andre-Gide>
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Andre-Gide/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Albert-Einstein/> from <GET http://quotes.toscrape.com/author/Albert-Einstein>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> from <GET http://quotes.toscrape.com/author/Marilyn-Monroe>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/J-K-Rowling/> from <GET http://quotes.toscrape.com/author/J-K-Rowling>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt/> from <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Steve-Martin/> from <GET http://quotes.toscrape.com/author/Steve-Martin>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Jane-Austen/> from <GET http://quotes.toscrape.com/author/Jane-Austen>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Thomas-A-Edison/> from <GET http://quotes.toscrape.com/author/Thomas-A-Edison>
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Andre-Gide/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'André Gide\n    '}
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/J-K-Rowling/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Jane-Austen/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Albert-Einstein/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Steve-Martin/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Thomas-A-Edison/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/J-K-Rowling/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'J.K. Rowling\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Jane-Austen/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Eleanor Roosevelt\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Eleanor-Roosevelt/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Marilyn Monroe\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Albert-Einstein/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Marilyn-Monroe/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Steve-Martin/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Thomas-A-Edison/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Thomas A. Edison\n    '}

It seems to complete all iterations of the parse() method and use the last item dictionary for later links. But if that's the case, all the aauthor values should have been the same. I searched a lot for the solution, but everything was beyond what I could understand at this point.Also, the requests seem to be asynchronous.

Would appreciate if someone explains the problem along with a working solution

Upvotes: 1

Views: 74

Answers (1)

vezunchik
vezunchik

Reputation: 3717

Your code is good, just move item creation to cycle, otherwise this is the same object with the same data:

def parse(self, response):
    for x in response.css('.quote'):
        item = {}
        item['quote'] = x.css('.text::text').get()
        item['author'] = x.css('.author::text').get()
        item['href'] = response.urljoin(x.css('a::attr(href)').get())
        yield scrapy.Request(item['href'], callback=self.parse_inside, meta={'item': item})

Upvotes: 1

Related Questions