Select every text node in a HTML document except script nodes with XPath

Question

I am currently writing a web crawler with Scrapy, and I would like to fetch all the text displayed on the screen of every HTML document with a single XPath query.

Here is the HTML I'm working with:


  
    Main title
    
      
      Paragraph

As you can see, there are some script tags that I want to filter when getting the text inside the body tag

Here is my first XPath query and its result:

XPath: /body/*//text()
Result: Main title / var grandson; / Paragraph / var child;

This is not good because it also fetches the text inside the script tag.

Here is my second try:

XPath: /body/*[not(self::script)]//text()
Result: Main title / var grandson; / Paragraph

Here, the last script tag (which is body's child) is filtered, but the inner script is not.

How would you filter all the script tags ? Thanks in advance.

Michael Kay · Accepted Answer

Try

//*[not(self::script)]/text()

Select every text node in a HTML document except script nodes with XPath

Answers (2)

Related Questions