Python pipeline using GNU Parallel

Question

I'm trying to write a wrapper around GNU Parallel in Python to run a command in parallel, but seem to be misunderstanding either how GNU Parallel works, system pipes and/or python subprocess pipes.

Essentially I am looking to use GNU Parallel to handle splitting up an input file and then running another command in parallel on multiple hosts.

I can investigate some pure python way to do this in the future, but it seems like it should be easily implemented using GNU Parallel.

t.py

#!/usr/bin/env python

import sys

print
print sys.stdin.read()
print

p.py

from subprocess import *
import os
from os.path import *

args = ['--block', '10', '--recstart', '">"', '--sshlogin', '3/:', '--pipe', './t.py']

infile = 'test.fa'

fh = open('test.fa','w')
fh.write('''>M02261:11:000000000-ADWJ7:1:1101:16207:1115 1:N:0:1
CAGCTACTCGGGGAATCCTTGTTGCTGAGCTCTTCCCTTTTCGCTCGCAGCTACTCGGGGAATCCTTGTTGCTGAGCTCTTCCCTTTTCGCTCGCAGCTACTCGGGGAATCCTTGTTGCTGAGCTCTTCCCTTTTCGCTCGCAGCTACTCGGGGAATCCTTGTTGCTGAGCTCTTCCCTTT
>M02261:11:000000000-ADWJ7:1:1101:21410:1136 1:N:0:1
ATAGTAGATAGGGACATAGGGAATCTCGTTAATCCATTCATGCGCGTCACTAATTAGATGACGAGGCATTTGGCTACCTTAAGAGAGTCATAGTTACTCCCGCCGTTTACC
>M02261:11:000000000-ADWJ7:1:1101:13828:1155 1:N:0:1
GGTTTAGAGTCTCTAGTCGATAGATCAATGTAGGTAAGGGAAGTCGGCAAATTAGATCCGTAACTTCGGGATAAGGATTGGCTCTGAAGGCTGGGATGACTCGGGCTCTGGTGCCTTCGCGGGTGCTTTGCCTCAACGCGCGCCGGCCGGCTCGGGTGGTTTGCGCCGCCTGTGGTCGCGTCGGCCGCTGCAGTCATCAATAAACAGCCAATTCAGAACTGGCACGGCTGAGGGAATCCGACGGTCTAATTAAAACAAAGCATTGTGATGGACTCCGCAGGTGTTGACACAATGTGATTTT
>M02261:11:000000000-ADWJ7:1:1101:14120:1159 1:N:0:1
GAGTAGCTGCGAGCGAAAAGGGAAGAGCTCAAGGGGAGGAAAAGAAACTAACAAGGATTCCCCGAGTAGCTGCGAGCGAAAAGGGAAGCGCCCAAGGGGGGCAACAGGAACTAACAAGAATTCGCCGACTAGCTGCGACCTGAAAAGGAAAAACCCAAGGGGAGGAAAAGAAACTAACAAGGATTCCCCGAGTAGCTGCGAGCAGAAAAGGAAAAGCACAAGAGGAGGAAACGACACTAATAAGACTTCCCATACAAGCGGCGAGCAAAACAGCACGAGCCCAACGGCGAGAAAAGCAAAA
>M02261:11:000000000-ADWJ7:1:1101:8638:1172 1:N:0:1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
''')
fh.close()

# Call 1
Popen(['parallel']+args, stdin=open(infile,'rb',0), stdout=open('output','w')).wait()

# Call 2
_cat = Popen(['cat', infile], stdout=PIPE)
Popen(['parallel']+args, stdin=_cat.stdout, stdout=open('output2','w')).wait()

# Call 3
Popen('cat '+infile+' | parallel ' + ' '.join(args), shell=True, stdout=open('output3','w')).wait()

Call 1 and Call 2 produce the same output while Call 3 produces the output I would expect where the input file was split up and contains empty lines between records.

I'm more curious about what the differences are between Call 1,2 and Call 3.

tripleee · Accepted Answer

TL;DR Don't quote ">" when shell=False.

If you use shell=True, you can use all the shell's facilities, like globbing, I/O redirection, etc. You will need to quote anything which needs to be escaped from the shell. You can pass the entire command line as a single string, and the shell will parse it.

unsafe = subprocess.Popen('echo `date` "my files" * >output', shell=True)

With shell=False, you have no "secret" side effects behind the scenes, and none of the shell's facilities are available to you. So you need to take care of globbing, redirection, etc on the Python side. On the plus account, you save a (potentially significant) extra process, you have more control, and you don't need (and indeed mustn't) quote things which had to be quoted when the shell was involved. In summary, this is also safer, because you can see exactly what you are doing.

cmd = ['echo']
cmd.append(datestamp())
cmd.append['my files']  # notice absence of shell quotes around string
cmd.extend(glob('*'))
safer = subprocess.Popen(cmd, shell=False, stdout=open('output', 'w+'))

(This still differs slightly, because with modern shells, echo is a builtin, whereas now, we will be executing an external utility /bin/echo or whichever executable with that name comes first in your PATH.)

Now, returning to your examples, the problem in your args is that you are quoting a literal ">" as the record separator. When a shell is involved, an unquoted right broket would invoke redirection, so to specify it as a string, it has to be escaped or quoted; but when no shell is in the picture, there isn't anything which handles (or requires) those quotes, so to pass a literal > argument, simply pass that literally.

With that out of the way, your call #1 definitely seems like the way to go. (Though I'm not entirely convinced that it's sane to write a Python wrapper for a shell command implemented in Perl. I suspect that juggling a bunch of parallel child processes in Python directly would not be more complicated.)

Python pipeline using GNU Parallel

Answers (1)

Related Questions