Reputation: 73
I'm using an HTML parser library to parse a web page into XML. With the XML I want to select nodes containing text that belong to each other using xPath queries.
Here's an example of the HTML:
<p><span style="font-family: 'Verdana','sans-serif'; font-size: 32pt;"><span style="font-family: 'Verdana','sans-serif'; font-size: 11pt; mso-bidi-font-size: 18.0pt;"> <span style="line-height: 115%; font-family: 'Verdana','sans-serif'; font-size: 36pt; mso-fareast-font-family: Calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: EN-US; mso-ansi-language: SV; mso-bidi-language: AR-SA;"> </span> VECKA 3</span></span></p><p><span style="font-family: 'Verdana','sans-serif'; font-size: 32pt;"></span><span style="font-family: 'Verdana','sans-serif'; font-size: 11pt; mso-bidi-font-size: 18.0pt;"> 17-21 JANUARI</span></p>
<p style="margin-bottom: 0pt;"><span style="font-family: 'Verdana','sans-serif'; font-size: 11pt; mso-bidi-font-size: 18.0pt;"> </span><span style="font-family: 'Verdana','sans-serif'; font-size: 11pt; mso-bidi-font-size: 18.0pt;">11.30-14.30</span></p>
<p style="margin-bottom: 0pt;"><span style="font-family: 'Verdana','sans-serif'; font-size: 10pt; mso-bidi-font-size: 15.0pt;">MÅNDAG: Parmesangratinerad tungafile med paprikasås</span></p>
<p style="margin-bottom: 0pt;"><span style="font-family: 'Verdana','sans-serif'; font-size: 10pt; mso-bidi-font-size: 15.0pt;"> Biffgryta med syltlök & ris</span></p>
Using xPath on the parsed piece of HTML, I want to select the <span>
-node containing the word MÅNDAG, but also the following <span>
-node which belongs to it. So for example I want to select the nodes that contain the text: "MÅNDAG: Parmesangratinerad tungafile med paprikasås" and the text "Biffgryta med syltlök & ris".
I think that I want to use an xPath that looks something like this:
"//span[contains(.,'MÅNDAG') or (contains(.,' ') and ../parent-sibling::/span[contains(.,'MÅNDAG')]]"
Any ideas?
Upvotes: 0
Views: 1267
Reputation:
I want to select the
<span>
-node containing the wordMÅNDAG
, but also the following<span>
-node which belongs to it
An XPath 1.0 expression without node set union:
//span[(.|preceding::span[1])[contains(.,'MÅNDAG')]]
Upvotes: 0
Reputation: 163635
In XPath 2.0:
//span[contains(.,'MÅNDAG')/(. | following::span[1])
In XPath 1.0:
//span[contains(.,'MÅNDAG') | //span[contains(.,'MÅNDAG')/following::span[1]
Upvotes: 0