Francesco Costa
Francesco Costa

Reputation: 1

How to call a linux command line program in parallel with python

I have a command-line program which runs on single core. It takes an input file, does some calculations, and returns several files which I need to parse to store the produced output. I have to call the program several times changing the input file. To speed up the things I was thinking parallelization would be useful. Until now I have performed this task calling every run separately within a loop with the subprocess module.

I wrote a script which creates a new working folder on every run and than calls the execution of the program whose output is directed to that folder and returns some data which I need to store. My question is, how can I adapt the following code, found here, to execute my script always using the indicated amount of CPUs, and storing the output. Note that each run has a unique running time. Here the mentioned code:

import subprocess
import multiprocessing as mp
from tqdm import tqdm

NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)

def work(sec_sleep):
    command = ['python', 'worker.py', sec_sleep]
    subprocess.call(command)


def update_progress_bar(_):
    progress_bar.update()


if __name__ == '__main__':
    pool = mp.Pool(NUMBER_OF_TASKS)

    for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
         pool.apply_async(work, (seconds,), callback=update_progress_bar)

    pool.close()
    pool.join()

Upvotes: 0

Views: 346

Answers (1)

Booboo
Booboo

Reputation: 44213

I am not entirely clear what your issue is. I have some recommendations for improvement below, but you seem to claim on the page that you link to that everything works as expected and I don't see anything very wrong with the code as long as you are running on Linux.

Since the subprocess.call method is already creating a new process, you should just be using multithreading to invoke your worker function, work. But had you been using multiprocessing and your platform was one that used the spawn method to create new processes (such as Windows), then having the creation of the progress bar outside of the if __name__ = '__main__': block would have resulted in the creation of 4 additional progress bars that did nothing. Not good! So for portability it would have been best to move its creation to inside the if __name__ = '__main__': block.

import subprocess
from multiprocessing.pool import ThreadPool
from tqdm import tqdm


def work(sec_sleep):
    command = ['python', 'worker.py', sec_sleep]
    subprocess.call(command)


def update_progress_bar(_):
    progress_bar.update()


if __name__ == '__main__':
    NUMBER_OF_TASKS = 4

    progress_bar = tqdm(total=NUMBER_OF_TASKS)

    pool = ThreadPool(NUMBER_OF_TASKS)

    for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
         pool.apply_async(work, (seconds,), callback=update_progress_bar)

    pool.close()
    pool.join()

Note: If your worker.py program prints to the console, it will mess up the progress bar (the progress bar will be re-written repeatedly on multiple lines).

Have you considered instead importing worker.py (some refactoring of that code might be necessary) instead of invoking a new Python interpreter to execute it (in this case you would want to be explicitly using multiprocessing). On Windows this might not save you anything since a new Python interpreter would be executed for each new process anyway, but this could save you on Linux:

import subprocess
from multiprocessing.pool import Pool
from worker import do_work
from tqdm import tqdm


def update_progress_bar(_):
    progress_bar.update()

if __name__ == '__main__':
    NUMBER_OF_TASKS = 4
    progress_bar = tqdm(total=NUMBER_OF_TASKS)
    pool = Pool(NUMBER_OF_TASKS)

    for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
         pool.apply_async(do_work, (seconds,), callback=update_progress_bar)

    pool.close()
    pool.join()

Upvotes: 1

Related Questions