Reputation: 39
I have the following data processing pipeline in Python 3.8:
My computer has 12cores/24threads. Ideally, I'd want 24 instances of the program running concurrently, one on each thread, each one exporting data from 1 category sequentially, as fast as possible.
If I only need to export, for example, 3 categories, I'd want the program to run on 24 threads, and each instance could use up to 8 threads.
First, I made a script that contains the 3 classes, and a main that runs everything. If I run this by itself, it'll successfully export 1 category of data, although slowly. We'll call this script.py.
Then, I made a function (that we are gonna call parallelize()) that runs the script.py using:
p = subprocess.Popen(mydir/script.py + [myargs], stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False, preexec_fn=os.setsid)
p.wait()
Using those, I've tried the following methods all with the same mediocre results:
No matter how I try to do it, the result is always the same: when I start up the program, all cpu cores and threads go to 100%, it starts exporting all categories at once, and it does so fast enough for my needs. Even if I'm just exporting 3 categories, it uses all 24 threads at 100%, indicating that it's making good use of the multithreading. BUT after just 5-10 minutes, all of a sudden it slows down, 1 thread remains used at 100%, the other 23 drop to about 10-20% usage, it only processes 1 category, and if I go to the processes on my Ubuntu System Monitor I see all the python instances running at 0% CPU, except for 1 which is running between 10 and 16%.
If I stop the export (it saves up to the point where it got to) and resume it, same thing happens. It would be paradoxically faster to run, stop and rerun the script every 5 minutes than to just let it run for days. What is stopping my CPU from running at 100% all the time instead of just the first 5 minutes?
I'm not using any async, threading, multithreading or multiprocessing inside my 3 classes in the script for the time being, and the slowest part of said script is iterating the csv rows.
Upvotes: 0
Views: 685
Reputation: 1567
Functions like joblib.Parallel
or multiprocessing.Pool.map
offer an easy way to process a list of tasks on multiple cores/threads. They usually take only the script/computation and an iterable as arguments. However, both functions distribute the task as they like. Pool.map
checks the iterable and devides it across the number of cores/threads, but not necessarily equal sized. So you could end up with core 1 having 100 tasks, while the rest of your cores only have 10 tasks each. It depends on your iterable and what both functions assume to be an adequate split.
The splitting of the iterable can even take more time than the actual computing time of the task, when the iterable is huge. Sometimes you run out of memory before any task has been started at all, due to the splitting process.
Coming from this, I migrated to always use Queues and do the splitting manually. That way, you have full control of the splitting and the consumed memory and you can debug the whole process.
So, in your case it would look similar to this:
def script(in_queue, out_queue):
for task in iter(in_queue.get, 'STOP'):
# do stuff with your task
out_queue.put(result)
And in your main thread:
if __name__ == "__main__":
in_queue = multiprocessing.Queue()
out_queue = multiprocessing.Queue()
numProc = # number of cores you like
process = [multiprocessing.Process(target=script,
args=(in_queue, out_queue) for x in range(numProc)]
for p in process:
p.start()
for category in categories:
in_queue.put(category)
for p in process:
in_queue.put('STOP')
With this scheme all processes do exactly the same: Take task from a queue, do the computation and put the result in another queue. If your cores would all have exactly the same speed the tasks would be done "chronologically" on one core after the other, like:
task1 -> core1
task2 -> core2
task3 -> core1
task4 -> core2
A situation like yours, 100% on the first core and 10% on all others, would only arise at the very end, when nearly all tasks are done.
Upvotes: 2