Scrapy only returning first result of each page

Question

As the question title implies I'm having trouble with the Web scraper library, Scrapy. It's only returning the first "quote" off each page of the Quotes to Scrape site.

I know this may seem simple to those who have mastered scrapy, but I'm having trouble with the concept used here. If someone could fix the error and explain the process, that would be great.

This is my current code:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class SpiderSpider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    base_url = 'http://quotes.toscrape.com'
    rules = [Rule(LinkExtractor(allow = 'page/', deny = 'tag/'),
                  callback='parse_filter_book', follow=True)]

    def parse_filter_book(self, response):
        title = response.xpath('//div/h1/a/text()').extract_first()
        author = response.xpath(
            '//div[@class = "quote"]/span/small/text()').extract_first()
        author_url = response.xpath(
            '//div[@class = "quote"]/span/a/@href').extract_first()
        final_author_url = self.base_url + author_url.replace('../..', '')
        quote = response.xpath(
            '//div[@class = "quote"]/span[@class= "text"]/text()').extract_first()

        yield {
            'Title': title,
            'Author': author,
            'URL': final_author_url,
            'Quote': quote,
        }

Currently I'm trying something based off this approach. I've seen others do something similar to this, but I'm failing to pull of the same.

    def parse_filter_book(self, response):
        for quote in response.css('div.mw-parser-output > div'):
            title = quote.xpath('//div/h1/a/text()').extract_first()
            author = quote.xpath(
                '//div[@class = "quote"]/span/small/text()').extract_first()
            author_url = quote.xpath(
                '//div[@class = "quote"]/span/a/@href').extract_first()
            final_author_url = self.base_url + author_url.replace('../..', '')
            quotes = quote.xpath(
                '//div[@class = "quote"]/span[@class= "text"]/text()').extract_first()

The current output is just 10 links, one from each of the 10 pages. With the new modified version, it produces no output, just an error.

It's also my goal just to be scraping the 10 pages in the site, hence why the rules are the way they are.

----- Update ----

Wow, thanks. I copied pasted the corrected function and am getting the desired output. Going through the explanation and comparing my old code to this new one right now, so will answer properly in a while.

renatodvc · Accepted Answer

Your first code sample will receive a response and will only extract one item, since there is no loop and the selectors are using extract_first():

 def parse_filter_book(self, response):
    title = response.xpath('//div/h1/a/text()').extract_first()
    ...
    yield {
        'Title': title,
        ...
    }

This is literally telling the spider to find in the response all elements that matches for this XPath //div/h1/a/text(), then extract_first() item that matched and set this value in the title variable. It will do the same for all the other variables, yield the result and finish it's execution.

The general idea in the second code is right, you select all elements that are a quote, iterate between them and extract the values in each iteration. There are a few issues though.

This will return empty:

response.css('div.mw-parser-output > div')

I don't see any element div with that class in the page. Replacing it by response.css('div.quote') is enough to select the quotes elements.

However we still need to fix your extraction paths. In this loop, quote is already an element of div[@class="quote"] so you should supress that as you want to look inside the selector.

for quote in response.css('div.quote'):
        title = quote.xpath('//div/h1/a/text()').get()
        author = quote.xpath('span/small/text()').get()
        author_url = quote.xpath('span/a/@href').get()
        final_author_url = response.urljoin(author_url)
        quotes = quote.xpath('span[@class="text"]/text()').get()

        yield {
            'Title': title,
            'Author': author,
            'URL': final_author_url,
            'Quote': quotes,  # I believe you meant quotes not quote, quote is the selector, quotes the text.
        }

Notes

I left title untouched, it will always scrape the same thing, the title of the page, wasn't sure if that was the intention.
I suggest you to use .get() method instead of .extract_first(). After Scrapy 1.5.2 they are the same thing, but allows for easier comprehension.
You can call response.urljoin() method to join the response's url with the relative url you scraped. Quite handy.

Scrapy only returning first result of each page

Answers (2)

Notes

Related Questions