Reputation: 1426
Is there some way to lock a zarr
store when using append
?
I have already found out the hard way that using append
with multiple processes is a bad idea (the batches to append aren't aligned with the batch size of the store). The reason I'd like to use multiple processes is because I need to transform the original arrays before appending them to the zarr
store. It would be nice to be able to block other processes from writing concurrently but still perform the transformations in parallel, then append their data in series.
Edit:
Thanks to jdehesa's suggestion, I became aware of the synchronization part of the documentation. I passed a ProcessSynchronizer
pointing to a folder on disk to my array at creation in the main thread, then spawned a bunch of worker processes with concurrent.futures
and passed the array to all the workers for them to append their results. I could see that the ProcessSynchronizer
did something, as the folder I pointed it to filled with files, but the array that my workers write to ended up missing rows (compared to when written from a single process).
Upvotes: 1
Views: 509
Reputation: 108
I ran into the same problem.
Copying my response from https://github.com/zarr-developers/zarr-python/issues/2077
The problem seems to be that the cached metadata is not updated after the shape is resized in another thread/process, leading to dropped rows.
I found two workarounds:
def fixed_append(arr, data, axis=0):
def fixed_append_nosync(data, axis=0):
arr._load_metadata_nosync()
return arr._append_nosync(data, axis=axis)
return arr._write_op(fixed_append_nosync, data, axis=axis)
cache_metadata=False
to force reloading at all data accessesUpvotes: 1