Roel Van de Paar
Roel Van de Paar

Reputation: 2228

How to dramatically increase the speed of xsltproc command?

This is the format of my XML data:

<?xml version="1.0" encoding="utf-8"?>
<rowdata>
  <row Id="1" type="1" data="text" ... />
  <row Id="2" type="2" data="text" parent="1" ... />
  <row Id="3" type="1" data="text" ... />
  <row Id="4" type="1" data="text" ... />
  <row Id="5" type="2" data="text" parent="4" ... />
  ...

And this is my XSL sheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="iso-8859-1"/>
<xsl:strip-space elements="*" />
<xsl:template match="/rowdata">
  <xsl:for-each select="row">
    <xsl:if test="@Id = 10000">
      <xsl:value-of select="@data"/><xsl:text>&#xa;</xsl:text>
    </xsl:if>
  </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

Facts:

  1. I cannot change the XML data
  2. I can change the XSL sheet
  3. There are many rows in the XML data
  4. The for-each selector can match only one row

Problem:

  1. This command: xsltproc input.xls input.xml is very slow. Execution takes about 10 seconds for a single run (and many need to be made)

Already tried:

  1. Researched if xsltproc can be made faster (multi-threaded run) - it cannot
  2. Researched if there was any bottleneck with the hardware - there is not (NVMe on very fast 16 threads CPU) At first I thought it would take a long time to read a 1GB file. It does not, it is only xsltproc processing that takes time

Three questions:

  1. Does this XSLT stylesheet look optimized?
  2. Is there a way to "terminate the search (i.e. cancel further read) when the record is found"?
  3. How can I dramatically increase the speed of the command above?

Upvotes: 0

Views: 996

Answers (2)

Michael Kay
Michael Kay

Reputation: 163458

What are you including in your 10 seconds? Does this include compiling the stylesheet and/or parsing/loading the source document, or is it purely the XSLT execution time?

I would expect that building an in-memory tree representation of your 900Mb input file is what is taking most of the time (10 seconds would be pretty fast for that operation). If you need to run the stylesheet many times, then the best way of improving performance will be to only build the source tree once and re-use it. But you then won't be able to run directly from the command line.

In principle you can speed up this kind of stylesheet by using keys:

<xsl:key name="k" match="row" use="@Id"/>
<xsl:template match="/rowdata">
  <xsl:value-of select="key('k', 10000)/@data"/>
</xsl:template>

However, that's only going to work if you can ensure that the key index is only built once, and is then used repeatedly. At this stage I can't tell you how this might work in xsltproc, because it's all getting processor-specific.

You can terminate the search after the first hit simply by adding the predicate [1]. But you're looking for bigger gains than that.

Upvotes: 1

michael.hor257k
michael.hor257k

Reputation: 117073

Assuming there can be only one row where Id is 1000, you could do simply:

<xsl:template match="/rowdata">
    <xsl:value-of select="row[@Id=1000]/@data"/>
</xsl:template>

I don't know if this will "dramatically increase the speed of the command".

Upvotes: 0

Related Questions