How to access an entire zarr array without loading the entire array into memory?

Question

I am using zarr to store a large, chunked array. Lets say the store and zarr array are created like so:

import zarr
import numpy as np

store = zarr.storage.LocalStore('mydata.zarr')
zarr_data = zarr.create_array(
    store=self.store, 
    name='data',
    shape=(40000, 100), 
    chunks=(1000, 100), 
    dtype='f4')

zarr_data[:] = np.random.randn(10000, 100)

I have a separate python program that will access the entire array, but not all at once; rather, it will retrieve data in batches.

store = zarr.storage.LocalStore('mydata.zarr')
root = zarr.open(store=self.store, mode='r')

for i in range(root['data'].shape[0]):
    batch = root['data'][i]

My question is, if I were to run this code, would the entire dataset eventually end up being loaded into memory? and if so, is there is a way to avoid this? In my application, the full zarr array will be too large to fit in memory. I think the ideal solution would be a way to release previously read chunks from memory when I no longer need them. Is there a way to do this?

Experiments

I tried to run some code similar to the above scenario and measure the memory usage. As you can see, the zarr array contains 16MB of data, but it appears that the process memory usage only increases about ~2MB in the course of iterating over the zarr array. Here is the code and a visualization of the memory usage:

Code I ran:

def current_memory():
    process = psutil.Process(os.getpid())
    memory_usage = process.memory_info().rss
    memory_usage_mb = memory_usage / (1024 ** 2)
    return memory_usage_mb

init_mem = current_memory()
mem_measurements = []

store = zarr.storage.LocalStore('./memtest.zarr')
root = zarr.open(store=store, mode='r')

print(root['data'].info)

batch_size=10
for i in range(0, root['data'].shape[0] - batch_size, batch_size):
    batch = root['data'][i:i+batch_size]
    mem_measurements.append(current_memory() - init_mem)

Output of print(root['data'].info):

Type               : Array
Zarr format        : 3
Data type          : DataType.float32
Shape              : (40000, 100)
Chunk shape        : (1000, 100)
Order              : C
Read-only          : False
Store type         : LocalStore
Filters            : ()
Serializer         : BytesCodec(endian=)
Compressors        : (ZstdCodec(level=0, checksum=False),)
No. bytes          : 16000000 (15.3M)

Measured memory usage: memory usage over batches

Closing thoughts

It seems like zarr doesn't load the entire array into memory when iterating over it. The next question is; why is that? does zarr cache anything or keep chunks in memory?

How to access an entire zarr array without loading the entire array into memory?

Experiments

Closing thoughts

Answers (1)

Related Questions