Max Mikhaylov
Max Mikhaylov

Reputation: 792

Python multiprocessing race condition

I discovered a strange error when using concurrent.futures to read from multiple text files.

Here is a small reproducible example:

import os
import concurrent.futures

def read_file(file):
    with open(os.path.join(data_dir, file),buffering=1000) as f:
        for row in f:
            try:
                print(row)
            except Exception as e:
                print(str(e))

if __name__ == '__main__':
    data_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data'))
    files = ['file1', 'file2']
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for file,_ in zip(files,executor.map(read_file,files)):
            pass    

file1 and file2 are arbitrary text files in the data directory.

I am getting the following error (basically a process tries to read data_dir variable before it is assigned):

concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\process.py", line 175, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\process.py", line 153, in _process_chunk
    return [fn(*args) for args in chunk]
  File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\process.py", line 153, in <listcomp>
    return [fn(*args) for args in chunk]
  File "C:\Users\my_username\Downloads\example.py", line 5, in read_file
    with open(os.path.join(data_dir, file),buffering=1000) as f:
NameError: name 'data_dir' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "example.py", line 16, in <module>
    for file,_ in zip(files,executor.map(read_file,files)):
  File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\_base.py", line 556, in result_iterator
    yield future.result()
  File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\_base.py", line 405, in result
    return self.__get_result()
  File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\_base.py", line 357, in __get_result
    raise self._exception
NameError: name 'data_dir' is not defined

If I place data_dir assignment before if __name__ == '__main__': block, I don't get this error and the code executes as expected.

What is causing this error? Clearly, data_dir is assigned before any asynchronous calls should be made in both cases.

Upvotes: 2

Views: 1640

Answers (2)

J. Anderson
J. Anderson

Reputation: 309

fork() not available on windows, so python use spawn to start new process, which will start a fresh python interpreter process, no memory will be shared, but python will try to recreate worker function environment in the new process, that's why module level variable works. See doc for more detail.

Upvotes: 2

dorian
dorian

Reputation: 6272

ProcessPoolExecutor spaws a new Python process, imports the right module and calls the function you provide. As data_dir will only be defined when you run the module, not when you import it, the error is to be expected.

Providing the data_dir file descriptor as an argument to read_file might work, as I believe that processes inherit the file descriptors of their parents. You'd need to check, though.

If were to use a ThreadPoolExecutor however, your example should work, as the spawned threads share memory.

Upvotes: 3

Related Questions