Reputation: 31486
With 5 MB document, the following query takes libxml2
3 seconds to evaluate. Is there anything I could do to speed things up? I need the resulting node-set for further processing, so no count
, etc.
Thanks!
descendant::text() | descendant::*
[
self::p or
self::h1 or
self::h2 or
self::h3 or
self::h4 or
self::h5 or
self::h6 or
self::dl or
self::dt or
self::dd or
self::ol or
self::ul or
self::li or
self::dir or
self::address or
self::blockquote or
self::center or
self::del or
self::div or
self::hr or
self::ins or
self::pre
]
Edit:
Using descendant::node()[self::text() or self::p or ...
as suggested by Jens Erat (see the accepted answer) significantly improved the speed; from the original 2.865330s to just perfect 0.164336s.
Upvotes: 0
Views: 1136
Reputation: 38662
Benchmarking without any document to benchmark on is very difficult.
Two ideas for optimizing:
Use as few descendant::
axis steps as possible. They're expensive and probably you can speed up a little bit. You can combine the text()
and element tests like this:
descendant::node()[self::text() or self::h1 or self::h2]
and extend for all elements (I'm keeping the query short for better readability).
Use string-tests instead of node tests. They could be faster (probably aren't, see the comments to the answer). You need to keep the text()
test, of course.
descendant::node()[self::text() or local-name(.) = 'h1' or local-name(.) = 'h2']
If you're often querying the same document, think about using a native XML database like BaseX, eXist DB, Zorba, Marklogic, ... (the first three are free). They're putting indices on your data and should be able to serve the results much faster (and support XPath 2.0/XQuery, which makes developing much easier). All of them have APIs for a large set of programming languages.
Upvotes: 3
Reputation: 241738
Your query is equivalent to
(descendant::text() | descendant::p
| descendant::h1 | descendant::h2 | descendant::h3 | descendant::h4 | descendant::h5 | descendant::h6
| descendant::dl | descendant::dt | descendant::dd | descendant::ol | descendant::ul | descendant::li
| descendant::dir | descendant::address | descendant::blockquote | descendant::center
| descendant::del | descendant::div | descendant::hr | descendant::ins | descendant::pre
)
But I am not able to measure any difference in its speed.
Upvotes: 0
Reputation: 194
Do you have libxml2 compiled with the --with-threads option enabled? If so, the most straightforward thing to do would be to throw a faster processor with more cores at the problem
Upvotes: 0