doniyor
doniyor

Reputation: 37876

python multiprocessing Pool vs Process?

Just being noob in this context:

I am try to run one function in multiple processes so I can process a huge file in shorter time

I tried

for file_chunk in file_chunks:
    p = Process(target=my_func, args=(file_chunk, my_arg2))
    p.start()
    # without .join(), otherwise main proc has to wait 
    # for proc1 to finish so it can start proc2

but it seemed not so really fast enough

now I ask myself, if it is really running the jobs parallelly. I thought about Pool also, but I am using python2 and it is ugly to make it map two arguments to the function.

am I missing something in my code above or the processes that are created this way (like above) run really paralelly?

Upvotes: 4

Views: 5177

Answers (2)

noxdafox
noxdafox

Reputation: 15040

The speedup is proportional to the amount of CPU cores your PC has, not the amount of chunks.

Ideally, if you have 4 CPU cores, you should see a 4x speedup. Yet other factors such as IPC overhead must be taken into account when considering the performance improvement.

Spawning too many processes will also negatively affect your performance as they will compete against each other for the CPU.

I'd recommend to use a multiprocessing.Pool to deal with most of the logic. If you have multiple arguments, just use the apply_async method.

from multiprocessing import Pool

pool = Pool()

for file_chunk in file_chunks:
    pool.apply_async(my_func, args=(file_chunk, arg1, arg2))  

Upvotes: 9

Fourier
Fourier

Reputation: 2983

I am not an expert either, but what you should try is using joblib Parallel

from joblib import Parallel, delayed  
import multiprocessing as mp

def random_function(args):
    pass

proc = mp.cpu_count()

Parallel(n_jobs=proc)(delayed(random_function)(args) for args in args_list)

This will run a certain function (random_function) using a number of available cpus (n_jobs).

Feel free to read the docs!

Upvotes: 3

Related Questions