Reputation: 3729
I am experimenting with a 3-dimensional zarr-array, stored on disk:
Name: /data
Type: zarr.core.Array
Data type: int16
Shape: (102174, 1100, 900)
Chunk shape: (12, 220, 180)
Order: C
Read-only: True
Compressor: Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Store type: zarr.storage.DirectoryStore
No. bytes: 202304520000 (188.4G)
No. bytes stored: 12224487305 (11.4G)
Storage ratio: 16.5
Chunks initialized: 212875/212875
As I understand it, zarr-arrays can also reside in memory - compressed, as if they were on disk. So I thought why not try to load the entire thing into RAM on a machine with 32 GByte memory. Compressed, the dataset would require approximately 50% of RAM. Uncompressed, it would require about 6 times more RAM than available.
Preparation:
import os
import zarr
from numcodecs import Blosc
import tqdm
zpath = '...' # path to zarr data folder
disk_array = zarr.open(zpath, mode = 'r')['data']
c = Blosc(cname = 'zstd', clevel=3, shuffle = Blosc.BITSHUFFLE)
memory_array = zarr.zeros(
disk_array.shape, chunks = disk_array.chunks,
dtype = disk_array.dtype, compressor = c
)
The following experiment fails almost immediately with an out of memory error:
memory_array[:, :, :] = disk_array[:, :, :]
As I understand it, disk_array[:, :, :]
will try to create an uncompressed, full-size numpy array, which will obviously fail.
Second attempt, which works but is agonizingly slow:
chunk_lines = disk_array.chunks[0]
chunk_number = disk_array.shape[0] // disk_array.chunks[0]
chunk_remain = disk_array.shape[0] % disk_array.chunks[0] # unhandled ...
for chunk in tqdm.trange(chunk_number):
chunk_slice = slice(chunk * chunk_lines, (chunk + 1) * chunk_lines)
memory_array[chunk_slice, :, :] = disk_array[chunk_slice, :, :]
Here, I am trying to reads a certain number of chunks at a time and put them into my in-memory array. It works, but it is about 6 to 7 times slower than what it took to write this thing to disk in the first place. EDIT: Yes, it's still slow, but the 6 to 7 times happened due to a disk issue.
What's an intelligent and fast way of achieving this? I'd guess, besides not using the right approach, my chunks might also be too small - but I am not sure.
EDIT: Shape, chunk size and compression are supposed to be identical for the on-disk array and the in-memory array. It should therefore be possible to eliminate the decompress-compress procedure in my example above.
I found zarr.convenience.copy
but it is marked as an experimental feature
, subject to further change.
Upvotes: 0
Views: 702
Reputation: 715
There are a couple of ways one might solve this issue today.
LRUStoreCache
to cache (some) compressed data in memory.dict
and use that as your store.The first option might be appropriate if you only want some frequently used data in-memory. Of course how much you load into memory is something you can configure. So this could be the whole array. This will only happen with data on-demand, which may be useful for you.
The second option just creates a new in-memory copy of the array by pulling all of the compressed data from disk. The one downside is if you intend to write back to disk this will be something you need to do manually, but it is not too difficult. The update
method is pretty handy for facilitating this copying of data between different stores.
Upvotes: 1
Reputation: 28684
You could conceivably try with fsspec.implementations.memory.MemoryFileSystem
, which has a .make_mapper()
method, with which you can make the kind of object expected by zarr.
However, this is really just a dict of path:io.BytesIO, which you could make yourself, if you want.
Upvotes: 1