Reputation: 6458

XPath expression for selecting all text in a given node, and the text of its chldren

Basically I need to scrape some text that has nested tags.

Something like this:

<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>

And I want an expression that will produce this:

This is an example bolded text

I have been struggling with this for hour or more with no result.

Any help is appreciated

Upvotes: 20

Answers (5)

Rafsan Jane

Reputation: 1

normal code

//div[@id='theNode']

to get all text but if they become split then

//div[@id='theNode']/text()

Not sure but if you provide me the link I will try

Upvotes: -1

jerrymouse

Reputation: 17812

If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:

txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""

selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'

Upvotes: 4

Lachlan Roche

Reputation: 25956

The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.

You want to call the XPath string() function on the div element.

string(//div[@id='theNode'])

You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.

normalize-space(//div[@id='theNode'])

// if theNode was the context node, you could use this instead
normalize-space()

You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.

var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;

The whitespace only text node between the span and b elements might be a problem.

Upvotes: 31

Sara

Reputation: 2515

How about this :

/div/text()[1] | /div/span/text() | /div/b/text() | /div/text()[2]

Hmmss I am not sure about the last part though. You might have to play with that.

Upvotes: -1

Dimitre Novatchev

Reputation: 243539

Use:

string(//div[@id='theNode'])

When this expression is evaluated, the result is the string value of the first (and hopefully only) div element in the document.

As the string value of an element is defined in the XPath Specification as the concatenation in document order of all of its text-node descendants, this is exactly the wanted string.

Because this can include a number of all-white-space text nodes, you may want to eliminate contiguous leading and trailing white-space and replace any such intermediate white-space by a single space character:

Use:

normalize-space(string(//div[@id='theNode']))

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  "<xsl:copy-of select="string(//div[@id='theNode'])"/>"
===========
  "<xsl:copy-of select="normalize-space(string(//div[@id='theNode']))"/>"
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the provided XML document:

<div id='theNode'> This is an 
    <span style="color:red">example</span>
    <b>bolded</b> text 
</div>

the two XPath expressions are evaluated and the results of these evaluations are copied to the output:

  " This is an 
    example
    bolded text 
"
===========
  "This is an example bolded text"

Upvotes: 2

XPath expression for selecting all text in a given node, and the text of its chldren

Answers (5)

Related Questions