Reputation: 2702
I am using python scrapy to scrape some data from a website.
the web site content is something like this
<html>
<div class="details">
<div class="a"> not needed</div>
content 1
<p>content 2</p>
<div>content 2</div>
<p>content 2</p>
<div>content 2</div>
<p>content 2</p>
<div class="b"> this is also not needed</div>
</div>
</html>
I need to get the full html data excluding div with class a,b.
so my output will be like this
<div class="details">
content 1
<p>content 2</p>
<div>content 2</div>
<p>content 2</p>
<div>content 2</div>
<p>content 2</p>
</div>
How can I write correct xpath for that or should I write xpath for div with class 'details','a','b 'and use string operations to remove div with class 'a','b'?
Note that here content is the text of and is not a child of div with class 'details'
Upvotes: 1
Views: 740
Reputation: 474161
You can get all children except the div
with class a
or b
using node()
and self::
syntax:
//div[@class="details"]/node()[not(self::div[@class="a" or @class="b"])]
Demo using scrapy shell
:
$ scrapy shell index.html
>>> nodes = response.xpath('//div[@class="details"]/node()[not(self::div[@class="a" or @class="b"])]').extract()
>>> print ''.join(nodes)
content 1
<p>content 2</p>
<div>content 2</div>
<p>content 2</p>
<div>content 2</div>
<p>content 2</p>
Upvotes: 4