Execute parallell .py scripts

Question

Say im having scrapper_1.py , scrapper_2.py, scrapper_3.py.

The way i run it now its from pycharm run/execute each in separate, this way i can see the 3 python.exe in execution at task manager.

Now im trying to write a master script say scrapper_runner.py that imports this scrappers as modules and run them all in parallel not sequential.

I tried examples with subprocess, multiprocessing even os.system from various SO posts ... but without any luck ... from logs they all run in sequence and from task manager i only see one python.exe execution.

Is this the right pattern for this kind of process ?

EDIT:1 (trying with concurrent.futures ProcessPoolExecutor) it runns sequentially.

from concurrent.futures import ProcessPoolExecutor

import scrapers.scraper_1 as scraper_1
import scrapers.scraper_2 as scraper_2
import scrapers.scraper_3 as scraper_3

## Calling method runner on each scrapper_x to kick off processes
runners_list = [scraper_1.runner(), scraper_1.runner(), scraper_3.runner()]



if __name__ == "__main__":


    with ProcessPoolExecutor(max_workers=10) as executor:
        for runner in runners_list:
            future = executor.submit(runner)
            print(future.result())

JohanL · Accepted Answer

Your problem is in how you setup the processes. You are not running the processes in parallel, even though you think you are. You are actually running them, when you add them to the runners_list and then you are running the result of each runner in parallel as multiprocesses.

What you want to do, is to add the functions to the runners_list without executing them, then have them being executed in your multiprocessing pool. The way to achieve this, is to add the function references, i.e. the name of the functions. To do this, you should not include the parantheses, since this is the syntax for calling functions and not just name them.

In addition, to have the futures execute asynchronously, it is not possible to have a direct call to future.result, as that will force the code to execute sequentially, to ensure that the results are available in the same sequnece as the functions are called.

This means that the soultion to your problem is

from concurrent.futures import ProcessPoolExecutor

import scrapers.scraper_1 as scraper_1
import scrapers.scraper_2 as scraper_2
import scrapers.scraper_3 as scraper_3

## NOT calling method runner on each scrapper_x to kick off processes
## Instead add them to the list of functions to be run in the pool
runners_list = [scraper_1.runner, scraper_1.runner, scraper_3.runner]

# Adding callback function to call when future is done.
# If result is not printed in callback, the future.result call will
# serialize the call sequence to ensure results in order
def print_result(future):
    print(future.result)

if __name__ == "__main__":
    with ProcessPoolExecutor(max_workers=10) as executor:
        for runner in runners_list:
            future = executor.submit(runner)
            future.add_done_callback(print_result)

As you can see, here the invocation of the runners does not happen when the list is created, but later, when the runner is submitted to the executor. And, when the results are ready, the callback is called, to print the result to screen.

Execute parallell .py scripts

Answers (2)

Related Questions