Ricardo
Ricardo

Reputation: 43

How to increase the speed of Saxon evaluation in C#?

I'm currently using Saxon to process Xquery in our .NET application. We're working with really big XML files (~2GB). When running the Xquery against one of these files using the Saxon binary file directly, the time it takes to complete the evaluation is around 2 minutes, but when doing the evaluation from my C# application the time elapsed increases to around 10 minutes, and I haven't yet been able to identify what I'm doing wrong.

This is what I'm doing when I run the Xquery using the Saxon binary file through the command line:

Query.exe -config:config.xml -q:XQueryTest.txt

These are the contents of the config.xml:

<configuration xmlns="http://saxon.sf.net/ns/configuration" edition="HE">
  <xquery defaultElementNamespace="http://www.irs.gov/efile"/>
</configuration>

And XQueryTest.txt contains the Xquery we are going to process. When running the Xquery from the command line, we modify it to indicate the file we will run it against, using the doc() function. Here is a sample line:

for 
    $ReturnData at $currentReturnDataPos in if(exists(doc("2GB.XML")/Return/ReturnData)) then doc("2GB.XML")/Return/ReturnData else element{'ReturnData'} {''}

As mentioned above, running this command, takes about 2 minutes to complete.

Now these is what I'm doing in my .NET application to make this same evaluation.

Processor processor = new Processor();
DocumentBuilder documentBuilder = processor.NewDocumentBuilder();
documentBuilder.IsLineNumbering = true;
documentBuilder.WhitespacePolicy = WhitespacePolicy.PreserveAll;
XQueryCompiler compiler = processor.NewXQueryCompiler();

string query = BuildXqueryString();

if (!String.IsNullOrEmpty(query))
{
    XQueryExecutable executable = compiler.Compile(query);
    XQueryEvaluator evaluator = executable.Load();

    using (XmlReader myReader = XmlReader.Create(@"C:\Users\Administrator\Desktop\2GB.xml"))
    {
        evaluator.ContextItem = documentBuilder.Build(myReader);
    }

    var evaluations = evaluator.Evaluate();
}

The issue we have is in this line: evaluator.ContextItem = documentBuilder.Build(myReader). Which is not even the evaluation, but just the loading of the file. This line takes just too much time to execute, and I need to know if that is expected, or if there's a way to increase its speed. I have used all the different overloads of the Build() method and they all take a lot of time to complete, way more than the 2 minutes that the execution takes when executing from the command line.

Regarding using the streaming capacity of Saxon to read the file by parts, because of the Xqueries we generate, that is not an option, as the Xquery can combine information in any part of the XML.

Upvotes: 1

Views: 618

Answers (2)

ond1
ond1

Reputation: 771

The slow performance is caused by using the .NET XmlReader to do the parsing. The Push/Pull SAX eventing handling with the .NET XML parser and the Saxon receiver is much slower than using the JAXP xerces parser directly, which is supplied within Saxon.

To force the JAXP parser, you can do the following should work:

evaluator.ContextItem = documentBuilder.Build(new Uri("file:///C:\Users\Administrator\Desktop\2GB.xml"));

Upvotes: 0

Michael Kay
Michael Kay

Reputation: 163458

We have seen a similar 5-to-1 ratio between Saxon on the Java platform and Saxon on the .NET platform in some cases, and we haven't got to the bottom of why it happens despite extensive investigation. Part of the reason is that it seems to be inconsistent. When we first shipped Saxon on .NET using the IKVMC cross-compiler, the ratio was much better, with only about a 25% overhead on .NET, but there seem to have been a number of changes in technology since then: Java VMs have got faster, IKVMC has switched from using the GNU Classpath library to OpenJDK, and .NET itself hasn't stood still.

It's new to me, though, that the same code should run much faster from the .NET command line than it runs from the .NET API.

The big difference here is that when you run from the command line, Saxon builds the document using the Apache Xerces parser (converted to .NET code using IKVMC), whereas when you use DocumentBuilder.build() in the way shown, you are using Microsoft's XmlReader.

I would expect the document building to run fastest when you supply a (file system) URI, but I can't say I've measured it. It might be worth doing some experiments (perhaps with smaller files) and showing us the results. Alternatively, have you tried using the doc() method from your application, rather than building the document first?

Upvotes: 1

Related Questions