sa125
sa125

Reputation: 28971

Extracting nested text from html using xpath

I'm trying to extract textual content from a html page that looks something like this:

<div class="content">
    <div class="section">
      Lorem <a href="..." class="link">ipsum</a> 
      dolor <a href="..." class="link">sit</a> amet, 
      consectetur <a href="..." class="link">adipiscing</a> elit
    </div>

    <div class="section">
      sed do <a href="..." class="link">eiusmod</a> tempor 
      incididunt <a href="..." class="link">ut</a> labore 
      et <a href="..." class="link">dolore</a>
    </div>
</div>

I just want to extract the text portion:

Lorem ipsum dolor amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore

My XPath (2.0) expression is //*[contains(@class, 'section')]. When I evaluate it using javax.xml.xpath.XPathExpression, I only retrieve the text that's outside the links:

Lorem dolor amet, consectetur elit, sed do tempor incididunt labore et

I haven't used XPath before - is there a better expression to extract the full text? thanks.

Upvotes: 0

Views: 1918

Answers (1)

dirkk
dirkk

Reputation: 6218

Your expression returns a complete XML element. Your processor then returns this as string by converting a the XML element to a text, so basically the same as you would have executed

//*[contains(@class, 'section')]/text()

In contrast, you can get all text elements also in the children by using the string() function:

//*[contains(@class, 'section')]/string()

Another way, as pointed out by Mathias Müller in the comments, would be to use

//*[contains(@class, 'section')]//text()

which returns all descendant-or-self text elements.

Upvotes: 5

Related Questions