Reputation: 239
I am currently writing a web crawler with Scrapy, and I would like to fetch all the text displayed on the screen of every HTML document with a single XPath query.
Here is the HTML I'm working with:
<body>
<div>
<h1>Main title</h1>
<div>
<script>var grandson;</script>
<p>Paragraph</p>
</div>
</div>
<script>var child;</script>
</body>
As you can see, there are some script
tags that I want to filter when getting the text inside the body
tag
Here is my first XPath query and its result:
XPath: /body/*//text()
Result: Main title / var grandson; / Paragraph / var child;
This is not good because it also fetches the text inside the script
tag.
Here is my second try:
XPath: /body/*[not(self::script)]//text()
Result: Main title / var grandson; / Paragraph
Here, the last script
tag (which is body
's child) is filtered, but the inner script
is not.
How would you filter all the script
tags ? Thanks in advance.
Upvotes: 2
Views: 1797
Reputation: 852
This xPath does what you want.
.//text()[not(parent::script)]
So we have looking what is parent of text.
More interesting sample. I can use it for each element which contains html code.
.//text()[not(ancestor::script|ancestor::style|ancestor::noscript)]
Upvotes: 1