Rudolf Adamkovič
Rudolf Adamkovič

Reputation: 31486

Large document XPath query performance

With 5 MB document, the following query takes libxml2 3 seconds to evaluate. Is there anything I could do to speed things up? I need the resulting node-set for further processing, so no count, etc.

Thanks!

descendant::text() | descendant::*
[
self::p or
self::h1 or
self::h2 or
self::h3 or
self::h4 or
self::h5 or
self::h6 or
self::dl or
self::dt or
self::dd or
self::ol or
self::ul or
self::li or
self::dir or
self::address or
self::blockquote or
self::center or
self::del or
self::div or
self::hr or
self::ins or
self::pre
]

Edit:

Using descendant::node()[self::text() or self::p or ... as suggested by Jens Erat (see the accepted answer) significantly improved the speed; from the original 2.865330s to just perfect 0.164336s.

Upvotes: 0

Views: 1136

Answers (3)

Jens Erat
Jens Erat

Reputation: 38662

Benchmarking without any document to benchmark on is very difficult.

Two ideas for optimizing:

  • Use as few descendant:: axis steps as possible. They're expensive and probably you can speed up a little bit. You can combine the text() and element tests like this:

    descendant::node()[self::text() or self::h1 or self::h2]
    

    and extend for all elements (I'm keeping the query short for better readability).

  • Use string-tests instead of node tests. They could be faster (probably aren't, see the comments to the answer). You need to keep the text() test, of course.

    descendant::node()[self::text() or local-name(.) = 'h1' or local-name(.) = 'h2']
    

If you're often querying the same document, think about using a native XML database like BaseX, eXist DB, Zorba, Marklogic, ... (the first three are free). They're putting indices on your data and should be able to serve the results much faster (and support XPath 2.0/XQuery, which makes developing much easier). All of them have APIs for a large set of programming languages.

Upvotes: 3

choroba
choroba

Reputation: 241738

Your query is equivalent to

(descendant::text() | descendant::p
    | descendant::h1  | descendant::h2  | descendant::h3 | descendant::h4  | descendant::h5 | descendant::h6
    | descendant::dl  | descendant::dt  | descendant::dd | descendant::ol  | descendant::ul | descendant::li
    | descendant::dir | descendant::address | descendant::blockquote | descendant::center
    | descendant::del | descendant::div | descendant::hr | descendant::ins | descendant::pre
)

But I am not able to measure any difference in its speed.

Upvotes: 0

CamHenlin
CamHenlin

Reputation: 194

Do you have libxml2 compiled with the --with-threads option enabled? If so, the most straightforward thing to do would be to throw a faster processor with more cores at the problem

Upvotes: 0

Related Questions