s-m-e
s-m-e

Reputation: 3729

What's an intelligent way of loading a compressed array completely from disk into memory - also (indentically) compressed?

I am experimenting with a 3-dimensional zarr-array, stored on disk:

Name: /data
Type: zarr.core.Array
Data type: int16
Shape: (102174, 1100, 900)
Chunk shape: (12, 220, 180)
Order: C
Read-only: True
Compressor: Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Store type: zarr.storage.DirectoryStore
No. bytes: 202304520000 (188.4G)
No. bytes stored: 12224487305 (11.4G)
Storage ratio: 16.5
Chunks initialized: 212875/212875

As I understand it, zarr-arrays can also reside in memory - compressed, as if they were on disk. So I thought why not try to load the entire thing into RAM on a machine with 32 GByte memory. Compressed, the dataset would require approximately 50% of RAM. Uncompressed, it would require about 6 times more RAM than available.

Preparation:

import os
import zarr
from numcodecs import Blosc
import tqdm
zpath = '...' # path to zarr data folder

disk_array = zarr.open(zpath, mode = 'r')['data']

c = Blosc(cname = 'zstd', clevel=3, shuffle = Blosc.BITSHUFFLE)
memory_array = zarr.zeros(
    disk_array.shape, chunks = disk_array.chunks,
    dtype = disk_array.dtype, compressor = c
    )

The following experiment fails almost immediately with an out of memory error:

memory_array[:, :, :] = disk_array[:, :, :]

As I understand it, disk_array[:, :, :] will try to create an uncompressed, full-size numpy array, which will obviously fail.

Second attempt, which works but is agonizingly slow:

chunk_lines = disk_array.chunks[0]
chunk_number = disk_array.shape[0] // disk_array.chunks[0]
chunk_remain = disk_array.shape[0] % disk_array.chunks[0] # unhandled ...
for chunk in tqdm.trange(chunk_number):
    chunk_slice = slice(chunk * chunk_lines, (chunk + 1) * chunk_lines)
    memory_array[chunk_slice, :, :] = disk_array[chunk_slice, :, :]

Here, I am trying to reads a certain number of chunks at a time and put them into my in-memory array. It works, but it is about 6 to 7 times slower than what it took to write this thing to disk in the first place. EDIT: Yes, it's still slow, but the 6 to 7 times happened due to a disk issue.

What's an intelligent and fast way of achieving this? I'd guess, besides not using the right approach, my chunks might also be too small - but I am not sure.

EDIT: Shape, chunk size and compression are supposed to be identical for the on-disk array and the in-memory array. It should therefore be possible to eliminate the decompress-compress procedure in my example above.

I found zarr.convenience.copy but it is marked as an experimental feature, subject to further change.


Related issue on GitHub

Upvotes: 0

Views: 702

Answers (2)

jakirkham
jakirkham

Reputation: 715

There are a couple of ways one might solve this issue today.

  1. Use LRUStoreCache to cache (some) compressed data in memory.
  2. Coerce your underlying store into a dict and use that as your store.

The first option might be appropriate if you only want some frequently used data in-memory. Of course how much you load into memory is something you can configure. So this could be the whole array. This will only happen with data on-demand, which may be useful for you.

The second option just creates a new in-memory copy of the array by pulling all of the compressed data from disk. The one downside is if you intend to write back to disk this will be something you need to do manually, but it is not too difficult. The update method is pretty handy for facilitating this copying of data between different stores.

Upvotes: 1

mdurant
mdurant

Reputation: 28684

You could conceivably try with fsspec.implementations.memory.MemoryFileSystem, which has a .make_mapper() method, with which you can make the kind of object expected by zarr.

However, this is really just a dict of path:io.BytesIO, which you could make yourself, if you want.

Upvotes: 1

Related Questions