bluefear
bluefear

Reputation: 279

Why is GNU parallel as slow as single-CPU xargs for this command?

I have a bash command that takes a directory full of XML files, runs them through XSLT to CSV's, and combines all of the transforms into a single file. I've been attempting to use parallel, but the CPU usage never goes above 100% for this command. I cannot use xargs for this because the output gets interspersed.

This takes ~30 seconds, but again, the output is interspersed: find /path/to/xml -type f -iname '*.xml' -print0 | xargs -0 -P8 xsltproc transform.xsl > out.txt

This takes ~90 seconds. Single Core. find /path/to/xml -type f -iname '*.xml' -print0 | xargs -0 xsltproc transform.xsl > out.txt

This also takes ~90 seconds. As slow as single-core, and CPU useage from top never goes above 100%. find /path/to/xml -type f -iname '*.xml' -print0 | parallel -0 xsltproc transform.xsl > out.txt

This seems so dead simple, I don't know what I'm missing. Could anyone offer a suggestion?

Upvotes: 1

Views: 782

Answers (1)

Ole Tange
Ole Tange

Reputation: 33685

GNU Parallel has an overhead per job in the order of 5 ms. So if your jobs are short lived, then this overhead will be the limiting factor.

xsltproc can take several files as arguments so this may help:

find /path/to/xml -type f -iname '*.xml' -print0 |
  parallel -X -0 xsltproc transform.xsl > out.txt

Edit

If this does the right thing:

find /path/to/xml -type f -iname '*.xml' -print0 |
  xargs -0 -P8 xsltproc transform.xsl > out.txt

(except for the mixed output), then the -X solution must also do the right thing. The xargs -P8 solution will put many filenames after transform.xsl. The same is the case for -X. Are you sure the output from xargs -P8 is the full (though mixed) output?

If xlstproc only works reliable with a single file name, try this:

find /path/to/xml -type f -iname '*.xml' |
  parallel --pipe -N100 --round-robin parallel xsltproc transform.xsl > out.txt

This spawns a parallel per cpu core. So you should now either see 100% CPU usage of all CPUs or 100% disk I/O. If the files are cached then you should see 100% CPU usage - a lot of it from GNU Parallel, though.

Upvotes: 1

Related Questions