Reputation: 59
As the question title implies I'm having trouble with the Web scraper library, Scrapy. It's only returning the first "quote" off each page of the Quotes to Scrape site.
I know this may seem simple to those who have mastered scrapy, but I'm having trouble with the concept used here. If someone could fix the error and explain the process, that would be great.
This is my current code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class SpiderSpider(CrawlSpider):
name = 'spider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
base_url = 'http://quotes.toscrape.com'
rules = [Rule(LinkExtractor(allow = 'page/', deny = 'tag/'),
callback='parse_filter_book', follow=True)]
def parse_filter_book(self, response):
title = response.xpath('//div/h1/a/text()').extract_first()
author = response.xpath(
'//div[@class = "quote"]/span/small/text()').extract_first()
author_url = response.xpath(
'//div[@class = "quote"]/span/a/@href').extract_first()
final_author_url = self.base_url + author_url.replace('../..', '')
quote = response.xpath(
'//div[@class = "quote"]/span[@class= "text"]/text()').extract_first()
yield {
'Title': title,
'Author': author,
'URL': final_author_url,
'Quote': quote,
}
Currently I'm trying something based off this approach. I've seen others do something similar to this, but I'm failing to pull of the same.
def parse_filter_book(self, response):
for quote in response.css('div.mw-parser-output > div'):
title = quote.xpath('//div/h1/a/text()').extract_first()
author = quote.xpath(
'//div[@class = "quote"]/span/small/text()').extract_first()
author_url = quote.xpath(
'//div[@class = "quote"]/span/a/@href').extract_first()
final_author_url = self.base_url + author_url.replace('../..', '')
quotes = quote.xpath(
'//div[@class = "quote"]/span[@class= "text"]/text()').extract_first()
The current output is just 10 links, one from each of the 10 pages. With the new modified version, it produces no output, just an error.
It's also my goal just to be scraping the 10 pages in the site, hence why the rules are the way they are.
----- Update ----
Wow, thanks. I copied pasted the corrected function and am getting the desired output. Going through the explanation and comparing my old code to this new one right now, so will answer properly in a while.
Upvotes: 0
Views: 1094
Reputation: 2183
The issue is with your quote selector that is returning an empty list:
response.css('div.mw-parser-output > div')
. Therefore you never enter the for
loop
To make sure that you are getting all the quotes, you could simply put all the quotes into a variable and then print it to make sure that you are getting what you need.
I also updated the xpaths in your spider as they were extracting data from the whole page and not the quote selector. Make sure to append .
to the start of your xpath when you already have a local selector object.
Example:
This will get the first author in your quote
selector
quote.xpath('.//span/small/text()').extract_first()
This will get you the first author on the webpage:
quote.xpath('//div[@class = "quote"]/span/small/text()').extract_first()
Working spider:
class SpiderSpider(CrawlSpider):
name = 'spider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
base_url = 'http://quotes.toscrape.com'
rules = [Rule(LinkExtractor(allow = 'page/', deny = 'tag/'),
callback='parse_filter_book', follow=True)]
def parse_filter_book(self, response):
quotes = response.css('.quote')
for quote in quotes:
# I'm not sure where this title is coming from in the quote
#title = quote.xpath('.//div/h1/a/text()').extract_first()
author = quote.xpath(
'.//span/small/text()').extract_first()
author_url = quote.xpath(
'.//span/a/@href').extract_first()
final_author_url = self.base_url + author_url.replace('../..', '')
text = quote.xpath(
'.//span[@class= "text"]/text()').extract_first()
yield {
'Author': author,
'URL': final_author_url,
'Quote': text,
}
Upvotes: 1
Reputation: 2564
Your first code sample will receive a response and will only extract one item, since there is no loop and the selectors are using extract_first()
:
def parse_filter_book(self, response):
title = response.xpath('//div/h1/a/text()').extract_first()
...
yield {
'Title': title,
...
}
This is literally telling the spider to find in the response all elements that matches for this XPath //div/h1/a/text()
, then extract_first()
item that matched and set this value in the title
variable.
It will do the same for all the other variables, yield
the result and finish it's execution.
The general idea in the second code is right, you select all elements that are a quote
, iterate between them and extract the values in each iteration. There are a few issues though.
This will return empty:
response.css('div.mw-parser-output > div')
I don't see any element div
with that class in the page. Replacing it by response.css('div.quote')
is enough to select the quotes elements.
However we still need to fix your extraction paths. In this loop, quote
is already an element of div[@class="quote"]
so you should supress that as you want to look inside the selector.
for quote in response.css('div.quote'):
title = quote.xpath('//div/h1/a/text()').get()
author = quote.xpath('span/small/text()').get()
author_url = quote.xpath('span/a/@href').get()
final_author_url = response.urljoin(author_url)
quotes = quote.xpath('span[@class="text"]/text()').get()
yield {
'Title': title,
'Author': author,
'URL': final_author_url,
'Quote': quotes, # I believe you meant quotes not quote, quote is the selector, quotes the text.
}
title
untouched, it will always scrape the same thing, the title of the page, wasn't sure if that was the intention..get()
method instead of .extract_first()
. After Scrapy 1.5.2 they are the same thing, but allows for easier comprehension.response.urljoin()
method to join the response
's url with the relative url you scraped. Quite handy.Upvotes: 1