Reputation: 792
I discovered a strange error when using concurrent.futures
to read from multiple text files.
Here is a small reproducible example:
import os
import concurrent.futures
def read_file(file):
with open(os.path.join(data_dir, file),buffering=1000) as f:
for row in f:
try:
print(row)
except Exception as e:
print(str(e))
if __name__ == '__main__':
data_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data'))
files = ['file1', 'file2']
with concurrent.futures.ProcessPoolExecutor() as executor:
for file,_ in zip(files,executor.map(read_file,files)):
pass
file1
and file2
are arbitrary text files in the data
directory.
I am getting the following error (basically a process tries to read data_dir
variable before it is assigned):
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\process.py", line 175, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\process.py", line 153, in _process_chunk
return [fn(*args) for args in chunk]
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\process.py", line 153, in <listcomp>
return [fn(*args) for args in chunk]
File "C:\Users\my_username\Downloads\example.py", line 5, in read_file
with open(os.path.join(data_dir, file),buffering=1000) as f:
NameError: name 'data_dir' is not defined
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "example.py", line 16, in <module>
for file,_ in zip(files,executor.map(read_file,files)):
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\_base.py", line 556, in result_iterator
yield future.result()
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\_base.py", line 405, in result
return self.__get_result()
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\_base.py", line 357, in __get_result
raise self._exception
NameError: name 'data_dir' is not defined
If I place data_dir
assignment before if __name__ == '__main__':
block, I don't get this error and the code executes as expected.
What is causing this error? Clearly, data_dir
is assigned before any asynchronous calls should be made in both cases.
Upvotes: 2
Views: 1640
Reputation: 309
fork()
not available on windows, so python use spawn
to start new process, which will start a fresh python interpreter process, no memory will be shared, but python will try to recreate worker function environment in the new process, that's why module level variable works. See doc for more detail.
Upvotes: 2
Reputation: 6272
ProcessPoolExecutor
spaws a new Python process, imports the right module and calls the function you provide.
As data_dir
will only be defined when you run the module, not when you import it, the error is to be expected.
Providing the data_dir
file descriptor as an argument to read_file
might work, as I believe that processes inherit the file descriptors of their parents. You'd need to check, though.
If were to use a ThreadPoolExecutor
however, your example should work, as the spawned threads share memory.
Upvotes: 3