Reputation: 213
I am looking to use the multiprocessing module to speed up the run time of some Transport Planning models. I've optimized as much as I can via 'normal' methods but at the heart of it is an absurdly parallel problem. Eg Perform the same set of matrix operations four 4 different sets of inputs, all independent information.
Pseudo Code:
for mat1,mat2,mat3,mat4 in zip([a1,a2,a3,a4],[b1,b2,b3,b4],[c1,c2,c3,c4],[d1,d2,d3,d4]):
result1 = mat1*mat2^mat3
result2 = mat1/mat4
result3 = mat3.T*mat2.T+mat4
So all I really want to do is process the iterations of this loop in parallel on a quad core computer. I've read up here and other places on the multiprocessing module and it seems to fit the bill perfectly except for the required:
if __name__ == '__main__'
From what I understand this means that you can only multiprocess code run from a script? ie if I do something like:
import multiprocessing
from numpy.random import randn
a = randn(100,100)
b = randn(100,100)
c = randn(100,100)
d = randn(100,100)
def process_matrix(mat):
return mat^2
if __name__=='__main__':
print "Multiprocessing"
jobs=[]
for input_matrix in [a,b,c,d]:
p = multiprocessing.Process(target=process_matrix,args=(input_matrix,))
jobs.append(p)
p.start()
It runs fine, however assuming I saved the above as 'matrix_multiproc.py', and defined a new file 'importing_test.py' which just states:
import matrix_multiproc
The multiprocessing does not happen because the name is now 'matrix_multiproc' and not 'main'
Does this mean I can never use parallel processing on an imported module? All I am trying to do is have my model run as:
def Model_Run():
import Part1, Part2, Part3, matrix_multiproc, Part4
Part1.Run()
Part2.Run()
Part3.Run()
matrix_multiproc.Run()
Part4.Run()
Sorry for a really long question to what is probably a simple answer, thanks!
Upvotes: 11
Views: 3875
Reputation: 354
You can use multiprocessing anywhere in your code, provided that the program's main module uses the
if __name__ == '__main__'
guard.
That's not the best way. The __main__
test is overly restrictive, it doesn't work in module code and limits what you can do. And it's not necessary. Any test that executes the initializing code exactly once across processes will do the job.
A better solution is to test for an environment variable. If it's unset, then set it and initialize your multiprocessing. Spawned processes will inherit environ changes, see variable is already set, and not create more processes.
A more complex way is to inspect all parent frames for the name __main__
. This way is harder and more error prone. KISS.
Rest of this post explains why these approaches are better.
For instance, say you want to start a listener process in an imported module. Listener will monitor a Queue for data and handle it (in this example, log it). Notionally we have this:
# ----- in module.py
def listener (queue) :
while true :
... # do something with queue
def init () :
'setup shared queue and start listener process'
manager = mp.get_manager ()
queue = mp.Queue ()
child = manager.Process (
target = listener,
args = queue,
daemon = true # so parent process doesn't hang waiting for child
)
child.start ()
# start the listener
init ()
def log (msg) :
queue.put (msg)
# ----- in main.py
import module
log ('foo')
The above code is fine with fork, but breaks with spawn. init () will run again in the child, spawning another child, and another, ...
How can you fix? You can't test for __name__ == '__main__'
when module.py is imported. __name__
will be module. So that doesn't work.
Here's the obvious (and bad) solution: you could move init () call to main.py and wrap it in __name__ == '__main__'
. This is bad. Now anyone importing module.py has to call module.init () before they can use it. Imagine if stdlib worked like this. You might have to call 15 or 20 init funcs before you could start doing anything, and it's easy to miss one. Too error prone.
The obvious solution also breaks when another module includes your module. Consider this scenario:
# --- mod1.py
def init () : ... # start a child listener process
def log (msg) : ...
# --- mod2.py
import mod1.py
def somefunc (*args) :
do something...
mod1.log (f'result is : { result }')
if __name__ == '__main__' :
mod1.init ()
# --- main.py
import mod2
mod2.somefunc ()
Now main.py breaks. mod1.init () was never called. main.py doesn't even import mod1. The init problem is worse than before. Testing for __main__
is a bad solution.
A better solution is to test for an environment variable. If it's not set, then this is first pass. Set the variable and init your processes. Spawned children will inherit environ, see it's already set, and not init again. This keeps all the init behavior where it belongs, inside the module.
The challenge with this method is making sure you pick a unique env var name that isn't used by other programs on the system or by other python modules. I recommend using __file__
because that doesnt change, unlike __name__
. To be really safe, use the full path.
In first example above, replace call to init ()
with the following:
# --- module.py
envkey = 'PYSPAWN_' + os.path.basename (__file__)
# start the listener on first pass (parent process)
if not os.environ.get (envkey, false) :
os.environ [envkey] = str (os.getpid ())
init ()
If you just use __name__
instead of __file__
, it will still work when module.py is imported. However, when module.py is run as the main script, name will be __main__
on first run, then main
or mpmain
or similar in spawned children (depends on multiprocessing lib you use). envkey
will have two different values and you'll get an extra child process.
You could also test for the name __main__
by looking at all parent frames as described here. Have to look through entire stack though, as caller may not be the main module, like in second example above. Not as simple and efficient as using environ. Simpler solution is better.
Update 5/2024 - Another option is to distinguish using an os-provided multiprocess locking mechanism. For instance, your module can try to acquire a file lock in exclusive mode. First process will succeed and others will immediately fail (if using F_NOWAIT). It's tricker than environ variables, with different potential failure points. First process has to stay running and hold the lock for duration of your program. But possible.
Upvotes: 2
Reputation: 67157
Does this mean I can never use parallel processing on an imported module?
No, it doesn't. You can use multiprocessing
anywhere in your code, provided that the program's main module uses the if __name__ == '__main__'
guard.
On Unix systems, you won't even need that guard, since it features the fork()
system call to create child processes from the main python
process.
On Windows, on the other hand, fork()
is emulated by multiprocessing
by spawning a new process that runs the main module again, using a different __name__
. Without the guard here, your main application will try to spawn new processes again, resulting in an endless loop, and eating up all your computer's memory pretty fast.
Upvotes: 12