Reputation: 13
So I am playing around with scrapy through the tutorial. I am trying to scrape the text, author and tags of each quote in the companion website when using CSS selectors like mentioned there:
for quote in response.css('div.quote'):
print quote.css('span.text::text').extract()
print quote.css('span small::text').extract()
print quote.css('div.tags a.tag::text').extract()
I get the desired result (i.e: print of each text, author and quotes once). But once using Xpath selectors like this:
for quote in response.xpath("//*[@class='quote']"):
print quote.xpath("//*[@class='text']/text()").extract()
print quote.xpath("//*[@class='author']/text()").extract()
print quote.xpath("//*[@class='tag']/text()").extract()
I get duplicates results!
I still can't find why there is such a difference between the 2.
Upvotes: 1
Views: 1049
Reputation: 1155
When you use // it will get all results from response. If you use .// then it scope will be limited to that selector. Try .//
instead of //
. It will solve your problem :-)
Upvotes: 1
Reputation: 86
Try .//
instead of //
for your relative searches e.g.
print quote.xpath(".//*[@class='text']/text()").extract()
When you use //
, although you're searching from quote
, it takes this to mean an absolute search so its context is still the root of the document. .//
however, means to search from .
- the current node - and the context of this search will be limited to the elements nested under quote
.
As a side note, if you're looking to get the exact same results, you should consider changing *
to the tags you used in the CSS search - span
or div
. In this case it doesn't make any difference but just a head's up for future reference.
Upvotes: 4