Reputation: 733
I am looking for a way to get text which is not inside an HTML element:
<div class="col-sm-4">
<strong>Handelnde Personen:</strong><br><br>
<strong>Geschäftsführer</strong><br>
Mr John Doe<br>
Privatperson<br>
.....<br>
<br>
I want to get "Mr John Doe".
The only way I see is looking for a strong element which contains "Geschäftsführer" and then look for the following text.
My idea so far:
//strong[contains(text(), 'Gesch')]/br/../text()
... I simply can't make it work.
Also, is there a "wildcard" for strings? That I could use
*esch*ftsf*hr*
for "Geschäftsführer"?
I highly appreciate your help, thanks!
Upvotes: 0
Views: 267
Reputation: 163585
Try
//strong[starts-with(., 'Gesch')]/following-sibling::text()[1]
As for wildcard matching, with XPath 2.0 you use regular expressions:
//strong[matches(., '.*esch.*ftsf.*hr.*')]
With XPath 3.0 you could also use the Unicode collation algorithm
//strong[compare(., 'Geschäftsführer',
'http://www.w3.org/2013/collation/UCA?strength=primary') = 0]
(strength=primary ignores case and accents)
But to get anything more advanced than XPath 1.0 in the browser, you would need to deploy Saxon-JS.
Another option with 1.0 is to use translate() to remove case and umlauts:
//strong[translate(., 'ABCD..XYZÄÖÜäöüß', 'abcd..xyzaouaous') = 'geschaftsfuhrer']
Note, in all these examples I have used "."
rather than "text()"
to get the string value of an element - this is recommended practice.
Upvotes: 1