Xpath: Extract text between tags, but stop as soon as an embedded tag occurs

Question

I would like to extract the text within the following HTML. However, everything that occurs within an enclosed HTML tag and everything that comes after it should be ignored.

The HTML appears in different forms.

Text 1 Text 2 Text 3 Text 4 Text 5

Desired result: "Text 1 Text 2 Text 3"

Other variants:

Text 1 Text 2
Text 1 Text 2 Text 3
Text 1

Desired result: "Text 1"

Text 1 Text 2 Text 3

Desired result: "Text 1 Text 2 Text3"

So everything after the occurrence of a span element with class "classC" should be ignored. It's also possible that "classC" doesn't appear at all.

I already tried //span[@class="classA"]//text()[parent::*[not(@class="classC")]], this ignores "classC" content, but returns the text after (Text 5 from the first example).

How can I achieve this?

Update:

With //span[@class="classC"]//parent::*/preceding::text() I'm getting a little closer to the matter. However, it still doesn't work with Text 1, which returns noting.

Siebe Jongebloed · Accepted Answer

Try this XPath:

//text()[not(preceding::span[@class="classC"]|ancestor::span[@class="classC"])]

but as Michael Kay said it could be very inefficient, depending on your source html.

Xpath: Extract text between tags, but stop as soon as an embedded tag occurs

Answers (2)

Related Questions