Carl
Carl

Reputation: 39

XPath to find nodes with text + all their descendants & siblings that match certain criteria

Background:
I'm trying to improve on a Greasemonkey script I found.
The script marks prices in foreign currencies and can translate them into the currency of your choice.

The main problem:
How to make the script handle when prices are listed with tags, such as:

<b><i>9.</i></b><sup>95</sup>EUR

(Newegg.com does this, for example - they write their prices like so: <span>$</span>174<sup>.99</sup>).

Currently, the script only finds prices that are listed in the same text node since the XPath expression being used is:

document.evaluate("//text()", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null)

Since the script needs to be fast, I'm trying to avoid stepping through the DOM too much...
Are there any XPath gurus who could help out with some smart solutions for this purpose?


More detailed description of the problem:
The code I now have for finding the text nodes:

var re_skip = /^(SCRIPT|IFRAME|TEXTAREA|STYLE|OPTION|TITLE|HEAD|NOSCRIPT)$/;  // List of elements whose text node-children can be skipped
text = document.evaluate("//text()", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);
var i = text.snapshotLength;
while (i--) {
    el = text.snapshotItem(i);
    if (!el.parentNode || re_skip.test(el.parentNode.nodeName.toUpperCase()) || el.parentNode.className == 'autocurrency') {
        continue;
    }
//  ...
//  (RegEx logic to check if prices can be found in the text)
}


I can re-write the regex to handle text such as "<span>$</span>174<sup>.99</sup>" as long as I find these text strings - preferably using XPath, as I have understood this to be much faster than stepping through the DOM.

Thank you very much in advance for any help you can give me with this!

--------------------------------------------------------------
EDIT:
OK, I realize now that the question could do with some clarification and some examples, so here they come. A web page might look something like this:

<body>
  <div>
    <span>9.95 <span>EUR</span></span><br />
    <span>8.<sup>95</sup></span>AU$<br />
    <table>
      <thead>
        <tr>
          <th>Bla</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td><b>7</b>.95kr</td>
        </tr>
      </tbody>
    </table>
    <div>Bla bla</div>
    6.95 <span>GBP</span>
  </div>
  <div><img src="" /><img src=""><span>Bla bla bla</span></div>
</body>

Now, in that example, the overhead isn't that great - I could just feed the whole source code, as a string, directly to the regex that finds prices. But normally, pages will have lots of non-text elements that would make the script very slow if I didn't use a fast XPath to parse out the texts. So, I'm looking for an XPath expression that would find the different texts in the example above, but not just the text content - since we also need tags that might surround parts of a price (a new <span> will later be created around the matched price, including any inline elements that might surround parts of the price).

I don't know exactly what the XPath could be made to return, but somehow I need to grab a hold of the following strings from the example page above:

"9.95 <span>EUR</span>"       (or possibly: "<span>9.95 <span>EUR</span></span>")
"<span>8.<sup>95</sup></span>AU$"
"Bla"                         (or possibly: "<th>Bla</th>")
"<b>7</b>.95kr"               (or possibly: "<td><b>7</b>.95kr</td>")
"Bla bla"                     (or possibly: "<div>Bla bla</div>")
"6.95 <span>GBP</span>"
"Bla bla bla"                 (or possibly: "<span>Bla bla bla</span>")

and then these strings can be parsed by the regex that finds prices.

Upvotes: 2

Views: 2185

Answers (1)

Martin Honnen
Martin Honnen

Reputation: 167696

Well you can certainly use a path like //*[not(self::script | self::textarea | self::style)]//text() to find only those text node descendants of element nodes that are not one of "script", "textarea", "style". So the regular expression test you have is not necessary, you could express that requirement with XPath. Whether that performs better I can't tell, you will have to check with the XPath implementations of the browser(s) you want to use the Greasemonkey script with.

Upvotes: 1

Related Questions