sobek
sobek

Reputation: 1426

How can one write lock a zarr store during append?

Is there some way to lock a zarr store when using append?

I have already found out the hard way that using append with multiple processes is a bad idea (the batches to append aren't aligned with the batch size of the store). The reason I'd like to use multiple processes is because I need to transform the original arrays before appending them to the zarr store. It would be nice to be able to block other processes from writing concurrently but still perform the transformations in parallel, then append their data in series.

Edit:

Thanks to jdehesa's suggestion, I became aware of the synchronization part of the documentation. I passed a ProcessSynchronizer pointing to a folder on disk to my array at creation in the main thread, then spawned a bunch of worker processes with concurrent.futures and passed the array to all the workers for them to append their results. I could see that the ProcessSynchronizer did something, as the folder I pointed it to filled with files, but the array that my workers write to ended up missing rows (compared to when written from a single process).

Upvotes: 1

Views: 509

Answers (1)

Jason Adhinarta
Jason Adhinarta

Reputation: 108

I ran into the same problem.

Copying my response from https://github.com/zarr-developers/zarr-python/issues/2077

The problem seems to be that the cached metadata is not updated after the shape is resized in another thread/process, leading to dropped rows.

I found two workarounds:

  • defining another a custom append function which forces metadata to be reloaded

def fixed_append(arr, data, axis=0):
    def fixed_append_nosync(data, axis=0):
        arr._load_metadata_nosync()
        return arr._append_nosync(data, axis=axis)
    return arr._write_op(fixed_append_nosync, data, axis=axis)
  • specifying cache_metadata=False to force reloading at all data accesses

Upvotes: 1

Related Questions