Multiprocessing thousands of files with external command

Question

I want to launch an external command from Python for about 8000 files. Every file is processed independently from the others. The only constraint is to continue execution once all files have been processed. I have 4 physical cores, each one with 2 logical cores (multiprocessing.cpu_count() returns 8). My idea was to use a pool of four parallel independent processes that are to be run on 4 of the 8 cores. This way my machine should be usable in the meantime.

Here's what I've been doing:

import multiprocessing
import subprocess
import os
from multiprocessing.pool import ThreadPool


def process_files(input_dir, output_dir, option):
    pool = ThreadPool(multiprocessing.cpu_count()/2)
    for filename in os.listdir(input_dir):  # about 8000 files
        f_in = os.path.join(input_dir, filename)
        f_out = os.path.join(output_dir, filename)
        cmd = ['molconvert', option, f_in, '-o', f_out]
        pool.apply_async(subprocess.Popen, (cmd,))
    pool.close()
    pool.join()


def main():
    process_files('dir1', 'dir2', 'mol:H')
    do_some_stuff('dir2')
    process_files('dir2', 'dir3', 'mol:a')
    do_more_stuff('dir3')

A sequential treatment takes 120s for a batch of 100 files. The multiprocessing version outlined above (function process_files) takes only 20s for the batch. However, when I run process_files on the whole set of 8000 files, my PC hangs and does not un-freeze after one hour.

My questions are:

1) I thought ThreadPool is supposed to initialize a pool of processes (of multiprocessing.cpu_count()/2 processes here, to be exact). However my computer hanging up on 8000 files but not on 100 suggests that maybe the size of the pool is not taken into account. Either that, or I'm doing something wrong. Could you explain?

2) Is this the right way to launch independent processes under Python when each of them must launch an external command, and in such a way that all the resources are not taken up by the processing?

larsks · Accepted Answer

I think your basic problem is the use of subprocess.Popen. That method does not wait for a command to complete before returning. Since the function returns immediately (even though the command is still running), the function is finished as far as your ThreadPool is concerned and it can spawn another...which means that you end up spawning 8000 or so processes.

You would probably have better luck using subprocess.check_call:

Run command with arguments.  Wait for command to complete.  If
the exit code was zero then return, otherwise raise
CalledProcessError.  The CalledProcessError object will have the
return code in the returncode attribute.

So:

def process_files(input_dir, output_dir, option):
    pool = ThreadPool(multiprocessing.cpu_count()/2)
    for filename in os.listdir(input_dir):  # about 8000 files
        f_in = os.path.join(input_dir, filename)
        f_out = os.path.join(output_dir, filename)
        cmd = ['molconvert', option, f_in, '-o', f_out]
        pool.apply_async(subprocess.check_call, (cmd,))
    pool.close()
    pool.join()

If you really don't care about the exit code, then you may want subprocess.call, which will not raise an exception in the event of a non-zero exit code from the process.

Multiprocessing thousands of files with external command

Answers (2)

Related Questions