wonder garance
wonder garance

Reputation: 349

java.lang.OutOfMemoryError while transforming XML in a huge directory

I want to transform XML files using XSLT2, in a huge directory with a lot of levels. There are more than 1 million files, each file is 4 to 10 kB. After a while I always receive java.lang.OutOfMemoryError: Java heap space.

My command is: java -Xmx3072M -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEna bled -XX:MaxPermSize=512M ...

Add more memory to -Xmx is not a good solution.

Here are my codes:

for (File file : dir.listFiles()) {
    if (file.isDirectory()) {
        pushDocuments(file);
    } else {
        indexFiles.index(file);
    }
}

public void index(File file) {
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();

    try {
        xslTransformer.xslTransform(outputStream, file);
        outputStream.flush();
        outputStream.close();
    } catch (IOException e) {
        System.err.println(e.toString());
    }
}

XSLT transform by net.sf.saxon.s9api

public void xslTransform(ByteArrayOutputStream outputStream, File xmlFile) {
    try {
        XdmNode source = proc.newDocumentBuilder().build(new StreamSource(xmlFile));
        Serializer out = proc.newSerializer();
        out.setOutputStream(outputStream);
        transformer.setInitialContextNode(source);
        transformer.setDestination(out);
        transformer.transform();

        out.close();
    } catch (SaxonApiException e) {
        System.err.println(e.toString());
    }
}

Upvotes: 5

Views: 4530

Answers (4)

Alexander
Alexander

Reputation: 3035

I had a similar problem that came from the javax.xml.transform package that used a ThreadLocalMap to cache the XML chunks that were read during XSLT. I Had to outsource the XSLT into its own Thread so that the ThreadLocalMap cleared when the new Thread died - this freed the memory. See here: https://www.ahoi-it.de/ahoi/news/java-xslt-memory-leak/1446

Upvotes: 0

Michael Kay
Michael Kay

Reputation: 163458

My usual recommendation with the Saxon s9api interface is to reuse the XsltExecutable object, but to create a new XsltTransformer for each transformation. The XsltTransformer caches documents you have read in case they are needed again, which is not what you want in this case.

As an alternative, you could call xsltTransformer.getUnderlyingController().clearDocumentPool() after each transformation.

(Please note, you can ask Saxon questions at saxonica.plan.io, which gives a good chance we [Saxonica] will notice them and answer them. You can also ask them here and tag them "saxon", which means we'll probably respond to the question at some point, though not always immediately. If you ask on StackOverflow with no product-specific tags, it's entirely hit-and-miss whether anyone will notice the question.)

Upvotes: 5

Prabhakaran Ramaswamy
Prabhakaran Ramaswamy

Reputation: 26094

Try this one

String[] files = dir.list();
for (String fileName : files) {
    File file = new File(fileName);
    if (file.isDirectory()) {
        pushDocuments(file);
    } else {
        indexFiles.index(file);
    }
}

Upvotes: 0

Peter Lawrey
Peter Lawrey

Reputation: 533670

I would check you don't have a memory leak. The number of files shouldn't matter as you are only processing one at at time and as long as you can process the largest file you should be able to process them all.

I suggest you run jstat -gc {pid} 10s while the program is running to look for memory leaks. What you should look for is the size of memory after a Full GC, if this is ever increasing, use the VisualVM memory profiler to work out why. Or use jmap -histo:live {pid} | head -20 for a hint.

If the memory is not increasing you have a file which is triggering the out of memory. This is because either a) the file is much bigger than the others, or uses much more memory b) it triggers a bug in the library.

Upvotes: 1

Related Questions