Phyo Arkar Lwin
Phyo Arkar Lwin

Reputation: 6891

Multi processing subprocess

I'm new to subprocess module of python, currently my implementation is not multi processed.

import subprocess,shlex
    def forcedParsing(fname):

        cmd = 'strings "%s"' % (fname)
        #print cmd
        args= shlex.split(cmd)
        try:
            sp = subprocess.Popen( args, shell = False, stdout = subprocess.PIPE, stderr = subprocess.PIPE )
            out, err = sp.communicate()
        except OSError:
            print "Error no %s  Message %s" % (OSError.errno,OSError.message)
            pass

        if sp.returncode== 0:
            #print "Processed %s" %fname
            return out

    res=[]
    for f in file_list: res.append(forcedParsing(f))

my questions:

  1. Is sp.communicate a good way to go? should I use poll?

    if I use poll I need a sperate process which monitors if process finished right?

  2. should I fork at the for loop?

Upvotes: 1

Views: 2095

Answers (3)

tokland
tokland

Reputation: 67860

1) subprocess.communicate() seems the right option for what you are trying to do. And you don't need to poll the proces, communicate() returns only when it's finished.

2) you mean forking to paralellize work? take a look at multiprocessing (python >= 2.6). Running parallel processes using subprocess is of course possible but it's quite a work, you cannot just call communicate(), which is blocking.

About your code:

cmd = 'strings "%s"' % (fname)
args= shlex.split(cmd)

Why not simply?

args = ["strings", fname]

As for this ugly pattern:

res=[]
for f in file_list: res.append(forcedParsing(f))

You should use list-comprehensions whenever possible:

res = [forcedParsing(f) for f in file_list]

Upvotes: 3

Jacob Oscarson
Jacob Oscarson

Reputation: 6393

About question 2: forking at the for loop will mostly speed things up if the script's supposed to run on a system with multiple cores/processors. It will consume more memory, though, and will stress IO harder. There will be a sweet spot somewhere that depends on the number of files in file_list, but only benchmarking on a realistic target system can tell you where it is. If you find that number, you could add an if len(file_list) > <your number>: with optional fork() 'ing [Edit: rather, as @tokland say's via multiprocessing if it's available on your Python version (2.6+)] that chooses the most efficient strategy on a per-job basis.

Read about Python profiling here: http://docs.python.org/library/profile.html

If you're on Linux, you can also run time: http://linuxmanpages.com/man1/time.1.php

Upvotes: 2

Mark Byers
Mark Byers

Reputation: 838216

There are several warnings in the subprocess documentation that advise you to use communicate to avoid problems with a processes blocking, so it would be a good idea to use that.

Upvotes: 1

Related Questions