Reputation: 99
I'm following the Scrapy official tutorial where I'm suppoused to scrape data from http://quotes.toscrape.com, the tutorial shows how to scrape the data with the following spider:
class QuotesSpiderCss(scrapy.Spider):
name = "quotes_css"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
quotes = response.css('div.quote')
for quote in quotes:
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags::text').extract()
}
Then crawling the spider to a JSON file it returns what's spected:
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n "]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["\n Tags:\n ", " \n \n ", "\n \n ", "\n \n "]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n "]},
...]
I'm trying to write the same Spider using xpath instead of css:
class QuotesSpiderXpath(scrapy.Spider):
name = 'quotes_xpath'
start_urls = [
'http://quotes.toscrape.com/page/1/'
]
def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
yield {
'text': quote.xpath("//span[@class='text']/text()").extract_first(),
'author': quote.xpath("//small[@class='author']/text()").extract_first(),
'tags': quote.xpath("//div[@class='tags']/text()").extract()
}
But this spider returns me a list with the same quote:
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n "]},
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n "]},
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n Tags:\n ", " \n \n ", "\n \n ", "\n \n ", "\n \n "]},
...]
Thanks in advance!
Upvotes: 1
Views: 1514
Reputation: 2594
The reason you get always the same quote is because you're not using a relative XPath. See documentation.
Add a prefixing dot to your XPath statements like in the following parse method:
def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
yield {
'text': quote.xpath(".//span[@class='text']/text()").extract_first(),
'author': quote.xpath(".//small[@class='author']/text()").extract_first(),
'tags': quote.xpath(".//div[@class='tags']/text()").extract()
}
Upvotes: 3