Reputation: 349
I want to transform XML files using XSLT2, in a huge directory with a lot of levels. There are more than 1 million files, each file is 4 to 10 kB. After a while I always receive java.lang.OutOfMemoryError: Java heap space.
My command is: java -Xmx3072M -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEna bled -XX:MaxPermSize=512M ...
Add more memory to -Xmx is not a good solution.
Here are my codes:
for (File file : dir.listFiles()) {
if (file.isDirectory()) {
pushDocuments(file);
} else {
indexFiles.index(file);
}
}
public void index(File file) {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
try {
xslTransformer.xslTransform(outputStream, file);
outputStream.flush();
outputStream.close();
} catch (IOException e) {
System.err.println(e.toString());
}
}
XSLT transform by net.sf.saxon.s9api
public void xslTransform(ByteArrayOutputStream outputStream, File xmlFile) {
try {
XdmNode source = proc.newDocumentBuilder().build(new StreamSource(xmlFile));
Serializer out = proc.newSerializer();
out.setOutputStream(outputStream);
transformer.setInitialContextNode(source);
transformer.setDestination(out);
transformer.transform();
out.close();
} catch (SaxonApiException e) {
System.err.println(e.toString());
}
}
Upvotes: 5
Views: 4530
Reputation: 3035
I had a similar problem that came from the javax.xml.transform package that used a ThreadLocalMap to cache the XML chunks that were read during XSLT. I Had to outsource the XSLT into its own Thread so that the ThreadLocalMap cleared when the new Thread died - this freed the memory. See here: https://www.ahoi-it.de/ahoi/news/java-xslt-memory-leak/1446
Upvotes: 0
Reputation: 163458
My usual recommendation with the Saxon s9api interface is to reuse the XsltExecutable object, but to create a new XsltTransformer for each transformation. The XsltTransformer caches documents you have read in case they are needed again, which is not what you want in this case.
As an alternative, you could call xsltTransformer.getUnderlyingController().clearDocumentPool()
after each transformation.
(Please note, you can ask Saxon questions at saxonica.plan.io, which gives a good chance we [Saxonica] will notice them and answer them. You can also ask them here and tag them "saxon", which means we'll probably respond to the question at some point, though not always immediately. If you ask on StackOverflow with no product-specific tags, it's entirely hit-and-miss whether anyone will notice the question.)
Upvotes: 5
Reputation: 26094
Try this one
String[] files = dir.list();
for (String fileName : files) {
File file = new File(fileName);
if (file.isDirectory()) {
pushDocuments(file);
} else {
indexFiles.index(file);
}
}
Upvotes: 0
Reputation: 533670
I would check you don't have a memory leak. The number of files shouldn't matter as you are only processing one at at time and as long as you can process the largest file you should be able to process them all.
I suggest you run jstat -gc {pid} 10s
while the program is running to look for memory leaks. What you should look for is the size of memory after a Full GC, if this is ever increasing, use the VisualVM memory profiler to work out why. Or use jmap -histo:live {pid} | head -20
for a hint.
If the memory is not increasing you have a file which is triggering the out of memory. This is because either a) the file is much bigger than the others, or uses much more memory b) it triggers a bug in the library.
Upvotes: 1