Reputation: 39
Background:
I'm trying to improve on a Greasemonkey script I found.
The script marks prices in foreign currencies and can translate them into the currency of your choice.
The main problem:
How to make the script handle when prices are listed with tags, such as:
<b><i>9.</i></b><sup>95</sup>EUR
(Newegg.com does this, for example - they write their prices like so: <span>$</span>174<sup>.99</sup>).
Currently, the script only finds prices that are listed in the same text node since the XPath expression being used is:
document.evaluate("//text()", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null)
Since the script needs to be fast, I'm trying to avoid stepping through the DOM too much...
Are there any XPath gurus who could help out with some smart solutions for this purpose?
More detailed description of the problem:
The code I now have for finding the text nodes:
var re_skip = /^(SCRIPT|IFRAME|TEXTAREA|STYLE|OPTION|TITLE|HEAD|NOSCRIPT)$/; // List of elements whose text node-children can be skipped
text = document.evaluate("//text()", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);
var i = text.snapshotLength;
while (i--) {
el = text.snapshotItem(i);
if (!el.parentNode || re_skip.test(el.parentNode.nodeName.toUpperCase()) || el.parentNode.className == 'autocurrency') {
continue;
}
// ...
// (RegEx logic to check if prices can be found in the text)
}
The check to discard text nodes whose parent elements are listed in "re_skip" could be done in the XPath expression as well (using the "not" notation), right? And this would give a speed-increase?
If an ordered XPath type is used instead, I guess I no longer will have to include a check to see if the parent of the text node being parsed is <span class="autocurrency"> (that is, the <span> that the script adds around matched prices).
If I've understood things correctly, normalize-space() (as suggested here), cannot be used in this case, since the script adds a <span class="autocurrency"> around the matched amount and we need to retain the correct index for where this <span> should be entered.
Is there a way for the XPath to allow only certain (inline) elements to be used in-between the currency values? Or perhaps it could do this: "when a node containing text is found, also include all of its children (and their children and so on) in the match - unless the child node is a block type element." (or perhaps it should read: "...unless the child node is a DIV, P, TABLE, or any of the elements in re_skip")
I can re-write the regex to handle text such as "<span>$</span>174<sup>.99</sup>" as long as I find these text strings - preferably using XPath, as I have understood this to be much faster than stepping through the DOM.
Thank you very much in advance for any help you can give me with this!
--------------------------------------------------------------
EDIT:
OK, I realize now that the question could do with some clarification and some examples, so here they come. A web page might look something like this:
<body>
<div>
<span>9.95 <span>EUR</span></span><br />
<span>8.<sup>95</sup></span>AU$<br />
<table>
<thead>
<tr>
<th>Bla</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>7</b>.95kr</td>
</tr>
</tbody>
</table>
<div>Bla bla</div>
6.95 <span>GBP</span>
</div>
<div><img src="" /><img src=""><span>Bla bla bla</span></div>
</body>
Now, in that example, the overhead isn't that great - I could just feed the whole source code, as a string, directly to the regex that finds prices. But normally, pages will have lots of non-text elements that would make the script very slow if I didn't use a fast XPath to parse out the texts. So, I'm looking for an XPath expression that would find the different texts in the example above, but not just the text content - since we also need tags that might surround parts of a price (a new <span> will later be created around the matched price, including any inline elements that might surround parts of the price).
I don't know exactly what the XPath could be made to return, but somehow I need to grab a hold of the following strings from the example page above:
"9.95 <span>EUR</span>" (or possibly: "<span>9.95 <span>EUR</span></span>")
"<span>8.<sup>95</sup></span>AU$"
"Bla" (or possibly: "<th>Bla</th>")
"<b>7</b>.95kr" (or possibly: "<td><b>7</b>.95kr</td>")
"Bla bla" (or possibly: "<div>Bla bla</div>")
"6.95 <span>GBP</span>"
"Bla bla bla" (or possibly: "<span>Bla bla bla</span>")
and then these strings can be parsed by the regex that finds prices.
Upvotes: 2
Views: 2185
Reputation: 167696
Well you can certainly use a path like //*[not(self::script | self::textarea | self::style)]//text()
to find only those text node descendants of element nodes that are not one of "script", "textarea", "style". So the regular expression test you have is not necessary, you could express that requirement with XPath. Whether that performs better I can't tell, you will have to check with the XPath implementations of the browser(s) you want to use the Greasemonkey script with.
Upvotes: 1