Reputation: 6097
I am writing a Python program that needs to clean many small strings using an external unix program which works as a filter. Currently, I create a new subprocess for each string I want to clean:
import subprocess
def cleanstring(s):
proc = subprocess.Popen(['/bin/filter','-n'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)
out, err = proc.communicate(s)
assert not err
return out
Obviously, this approach is grossly inefficient. What would be an efficient way to start the filter subprocess and communicate with it via stdin/stdout for as long as needed?
I've been looking into using Python Queues to implement this, but they may be an overkill for this. The code will be called from a Django view on a non-threaded web server, so it will only be a single thread calling it multiple times.
Thanks!
Upvotes: 3
Views: 295
Reputation: 2152
If you haven't measured it, then it's not a performance problem, much less "grossly inefficient".
That said, you can communicate with a subprocess like this:
import subprocess
import sys
p = subprocess.Popen('bc', shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for i in range(10):
p.stdin.write('%s*2\n' % (i,))
res = p.stdout.readline()
if res:
print "vtrip says %s*2 is %s" % (i, res.strip())
p.stdin.flush()
This prints the doubles of 0-9, as returned by the same bc process. Should be easy to adapt to detex (the main thing would be to handle flush correctly so one end doesn't stuck waiting for the other).
That's the communicating part. As for the "long running inside Django" might not be a good idea. Queues might indeed be too much.
And task queues like Celery et al are for tasks to be handled independently, not for the same long running service handling each one.
Maybe run some small python daemon on the side, keeping the filter process open and handling requests from Django for it? Are we talking about heavy load, or something internal, for, say, 100 users per day? You might not need much synchronisation besides some crude locking at all.
Upvotes: 2
Reputation: 76725
I think your current code is the best solution. Under Linux, starting up a process isn't really that expensive, and you have neatly encapsulated the problem. And you are directly starting the filter
program, so you don't have the overhead of starting up a shell to run it.
Also, I am rather worried about caching. Suppose you do get the filter
program running in the background, reading and writing named pipes or whatever. How will you be sure that each string you push through comes out immediately? How will you flush the pipeline to synchronize the output with the input?
Have you measured the load on your Django server and found this to be a problem? If you have measured the performance, please share the numbers. I'd be surprised if you actually have a problem.
Upvotes: 1