python multiprocessing Pool vs Process?

Question

Just being noob in this context:

I am try to run one function in multiple processes so I can process a huge file in shorter time

I tried

for file_chunk in file_chunks:
    p = Process(target=my_func, args=(file_chunk, my_arg2))
    p.start()
    # without .join(), otherwise main proc has to wait 
    # for proc1 to finish so it can start proc2

but it seemed not so really fast enough

now I ask myself, if it is really running the jobs parallelly. I thought about Pool also, but I am using python2 and it is ugly to make it map two arguments to the function.

am I missing something in my code above or the processes that are created this way (like above) run really paralelly?

noxdafox · Accepted Answer

The speedup is proportional to the amount of CPU cores your PC has, not the amount of chunks.

Ideally, if you have 4 CPU cores, you should see a 4x speedup. Yet other factors such as IPC overhead must be taken into account when considering the performance improvement.

Spawning too many processes will also negatively affect your performance as they will compete against each other for the CPU.

I'd recommend to use a multiprocessing.Pool to deal with most of the logic. If you have multiple arguments, just use the apply_async method.

from multiprocessing import Pool

pool = Pool()

for file_chunk in file_chunks:
    pool.apply_async(my_func, args=(file_chunk, arg1, arg2))

python multiprocessing Pool vs Process?

Answers (2)

Related Questions