what is the difference response.xpath and response.css

Question

I tried to learn response.xpath and response.css using the site: http://quotes.toscrape.com/

scrapy shell 'http://quotes.toscrape.com'
for quote in response.css("div.quote"):
    title = quote.css("span.text::text").extract()

this will get one value only. but if I use xpath:

scrapy shell 'http://quotes.toscrape.com'
    for quote in response.css("div.quote"):
    title = quote.xpath('//*[@class="text"]/text()').extract()

it will get a list of all titles on the whole page.

Can some people tell me what is different using the two tools? some element I prefer use response.xpath, such as specific table content, it is easy to get by following-sibling, but response.css cannot get

gmolau · Accepted Answer

For a general explanation of the difference between XPath and CSS see the Scrapy docs:

Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.

XPath is a language for selecting nodes in XML documents, which can also be used with HTML. CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.

XPath offers more features than pure CSS selection (the Wikipedia article gives a nice overview), at the cost of being harder to learn. Scrapy converts CSS selectors to XPath internally, so the .css() function is basically syntactic sugar for .xpath() and you can use whichever one you feel more comfortable with.

Regarding your specific examples, I think the problem is that your XPath query is not actually relative to the previous selector (the quote div), but absolute to the whole document. See this quote from "Working with relative XPaths" in the Scrapy docs:

Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.

To get the same result as with your CSS selector you could use something like this, where the XPath query is relative to the quote div:

for quote in response.css('div.quote'):
    print(quote.xpath('span[@class="text"]/text()').extract())

Note that XPath also has the . expression to make any query relative to the current node, but I'm not sure how Scrapy implements this (using './/*[@class="text"]/text()' does also give the result you want).

what is the difference response.xpath and response.css

Answers (1)

Related Questions