Reputation: 119
I have to regularly transform large amount of XML files (min. 100K) within 1 folder each time (basically, from the unzipped input dataset), and I'd like to learn how to do that in the most efficient way as possible. My technological stack consists of XLTs and the Saxon XSLT Java libraries, called from Bash scripts. And it runs on an Ubuntu server with 8 cores and a Raid of SSD with 64Gb of Ram. Keep in mind I handle nicely XSLT but I'm still in the process of learning Bash and how to distribute the loads properly for such tasks (and Java is almost just a word at that point too).
I previously created a post regarding this issue, as my approach seemed very inefficient and was actually in need of help to properly run (See this SOF post). A lot of comments later, it makes sense to present the issue differently, therefore this post. I was proposed several solutions, one currently working much better than mine, but it could still be more elegant and efficient.
Now, I'm running this :
printf -- '-s:%s\0' input/*.xml | xargs -P 600 -n 1 -0 java -jar saxon9he.jar -xsl:some-xslt-sheet.xsl
I set 600 processes based on some previous tests. Going higher would just throw memory errors from Java. But it is only using between 30 to 40Gb of Ram now (all 8 cores are at 100% though).
To put it in a nutshell, here is all the advices/approaches I have so far :
collection()
function to parse the XML fileslibxml/libxslt
(isn't it only for XSLT1.0?)xmlsh
I can handle the solution #2, and it should directly enable to control the loop and load JVM only once ; the #1 seems more clumsy and I still need to improve in Bash (load distribution & perf, tackling relative/absolute paths) ; the #3, #4 and #5 are totally new to me and I may need more explanations to see how to tackle that.
Any input would be greatly appreciated.
Upvotes: 0
Views: 1133
Reputation: 328
Please try using “GNU parallel” instead of xargs to distribute tasks across multiple machines. I’ve not used it.
I regularly run Saxon across 100s of files.
You can look at my script.
ls -d "$@" | grep -v intermediate | grep -v "\.new" | tr '\n' '\0'| xargs -0 -P "$PROCESSORS" java net.coderextreme.RunSaxon --- "${OVERWRITE}" --"${STYLESHEETDIR}/X3dToJson.xslt" -json | sed 's/^\(.*\)$/"\1"/' | xargs -P "$PROCESSORS" "${NODE}" "${NODEDIR}/json2all.js"
I would especially consider using the -L # argument to xargs so you can batch files into a several calls to Java, but I don’t like it. Your experience may be different. I only have 100 or so files.
Here’s my Java code that calls Saxon:
https://github.com/coderextreme/X3DJSONLD/blob/master/src/main/java/net/coderextreme/RunSaxon.java
Use with care!
Upvotes: 0
Reputation: 2885
Try using the xsltproc
command line tool from libxslt
. It can take multiple xml files as arguments. To call it like that, you'll need to create an output directory first. Try calling it like this:
mkdir output
xsltproc -o output/ some-xslt-sheet.xsl input/*.xml
Upvotes: 0
Reputation: 163262
"the most efficient way possible" is asking a lot, and is not usually a reasonable objective. I doubt, for example, that you would be prepared to put in 6 months' effort to improve the efficiency of the process by 3%. What you are looking for is a way of doing it that meets performance targets and can be implemented with minimum effort. And "efficiency" itself begs questions about what your metrics are.
I'm pretty confident that the design I have suggested, with a single transformation processing all the files using collection() and xsl:result-document (which are both parallelized in Saxon-EE) is capable of giving good results, and is likely to be a lot less work than the only other approach I would consider, which is to write a Java application to hold the "control logic": although if you're good at writing multi-threaded Java applications then you can probably get that to go a bit faster by taking advantage of your knowledge of the workload.
Upvotes: 1