b1onic
b1onic

Reputation: 239

Select every text node in a HTML document except script nodes with XPath

I am currently writing a web crawler with Scrapy, and I would like to fetch all the text displayed on the screen of every HTML document with a single XPath query.

Here is the HTML I'm working with:

<body>
  <div>
    <h1>Main title</h1>
    <div>
      <script>var grandson;</script>
      <p>Paragraph</p>
    </div>
  </div>
  <script>var child;</script>
</body>

As you can see, there are some script tags that I want to filter when getting the text inside the body tag

Here is my first XPath query and its result:

XPath: /body/*//text()
Result: Main title / var grandson; / Paragraph / var child;

This is not good because it also fetches the text inside the script tag.

Here is my second try:

XPath: /body/*[not(self::script)]//text()
Result: Main title / var grandson; / Paragraph

Here, the last script tag (which is body's child) is filtered, but the inner script is not.

How would you filter all the script tags ? Thanks in advance.

Upvotes: 2

Views: 1797

Answers (2)

Lance
Lance

Reputation: 852

This xPath does what you want.

.//text()[not(parent::script)]

So we have looking what is parent of text.

More interesting sample. I can use it for each element which contains html code.

.//text()[not(ancestor::script|ancestor::style|ancestor::noscript)]

Upvotes: 1

Michael Kay
Michael Kay

Reputation: 163625

Try

//*[not(self::script)]/text()

Upvotes: 4

Related Questions