Reputation: 37876
Just being noob in this context:
I am try to run one function in multiple processes so I can process a huge file in shorter time
I tried
for file_chunk in file_chunks:
p = Process(target=my_func, args=(file_chunk, my_arg2))
p.start()
# without .join(), otherwise main proc has to wait
# for proc1 to finish so it can start proc2
but it seemed not so really fast enough
now I ask myself, if it is really running the jobs parallelly. I thought about Pool also, but I am using python2 and it is ugly to make it map two arguments to the function.
am I missing something in my code above or the processes that are created this way (like above) run really paralelly?
Upvotes: 4
Views: 5177
Reputation: 15040
The speedup is proportional to the amount of CPU cores your PC has, not the amount of chunks.
Ideally, if you have 4 CPU cores, you should see a 4x speedup. Yet other factors such as IPC overhead must be taken into account when considering the performance improvement.
Spawning too many processes will also negatively affect your performance as they will compete against each other for the CPU.
I'd recommend to use a multiprocessing.Pool
to deal with most of the logic. If you have multiple arguments, just use the apply_async
method.
from multiprocessing import Pool
pool = Pool()
for file_chunk in file_chunks:
pool.apply_async(my_func, args=(file_chunk, arg1, arg2))
Upvotes: 9
Reputation: 2983
I am not an expert either, but what you should try is using joblib
Parallel
from joblib import Parallel, delayed
import multiprocessing as mp
def random_function(args):
pass
proc = mp.cpu_count()
Parallel(n_jobs=proc)(delayed(random_function)(args) for args in args_list)
This will run a certain function (random_function) using a number of available cpus (n_jobs).
Feel free to read the docs!
Upvotes: 3