Reputation: 279
I have a bash command that takes a directory full of XML files, runs them through XSLT to CSV's, and combines all of the transforms into a single file. I've been attempting to use parallel
, but the CPU usage never goes above 100% for this command. I cannot use xargs
for this because the output gets interspersed.
This takes ~30 seconds, but again, the output is interspersed:
find /path/to/xml -type f -iname '*.xml' -print0 | xargs -0 -P8 xsltproc transform.xsl > out.txt
This takes ~90 seconds. Single Core.
find /path/to/xml -type f -iname '*.xml' -print0 | xargs -0 xsltproc transform.xsl > out.txt
This also takes ~90 seconds. As slow as single-core, and CPU useage from top
never goes above 100%.
find /path/to/xml -type f -iname '*.xml' -print0 | parallel -0 xsltproc transform.xsl > out.txt
This seems so dead simple, I don't know what I'm missing. Could anyone offer a suggestion?
Upvotes: 1
Views: 782
Reputation: 33685
GNU Parallel has an overhead per job in the order of 5 ms. So if your jobs are short lived, then this overhead will be the limiting factor.
xsltproc
can take several files as arguments so this may help:
find /path/to/xml -type f -iname '*.xml' -print0 |
parallel -X -0 xsltproc transform.xsl > out.txt
Edit
If this does the right thing:
find /path/to/xml -type f -iname '*.xml' -print0 |
xargs -0 -P8 xsltproc transform.xsl > out.txt
(except for the mixed output), then the -X
solution must also do the right thing. The xargs -P8
solution will put many filenames after transform.xsl
. The same is the case for -X
. Are you sure the output from xargs -P8
is the full (though mixed) output?
If xlstproc
only works reliable with a single file name, try this:
find /path/to/xml -type f -iname '*.xml' |
parallel --pipe -N100 --round-robin parallel xsltproc transform.xsl > out.txt
This spawns a parallel
per cpu core. So you should now either see 100% CPU usage of all CPUs or 100% disk I/O. If the files are cached then you should see 100% CPU usage - a lot of it from GNU Parallel, though.
Upvotes: 1