Reputation: 3815
In the documentation and SO articles, there are only references on how to exclude CSS classes using this nomenclature:
response.css("div[id='content']:not([class*='infobox'])")
What I want to achieve however is to exclude a node, or even, multiple nodes, such as <span>
and <div>
elements which are inside an <li>
element.
Let me give you an example. Let's say I am scraping this HTML:
<li class="classA">
<div class="classB">
..
</div>
<span class="classC">Whatever</span>
This is the string I want to scrape
</li>
,and I am only interested in scraping the text "This is the string I want to scrape", thus I want to skip both <div>
and <span>
nodes. I tried to use the following, inside the scrapy shell, to no avail:
response.css(".classA:not(span|div)::text").extract()
,but I am still getting the excluded nodes.
Upvotes: 0
Views: 2148
Reputation: 108
response.css('li.classA::text').extract_first()
response.xpath('//li[@class = "classA"]/text()').extract_first()
Upvotes: 2