IanDunn
IanDunn

Reputation: 21

How to access an entire zarr array without loading the entire array into memory?

I am using zarr to store a large, chunked array. Lets say the store and zarr array are created like so:

import zarr
import numpy as np

store = zarr.storage.LocalStore('mydata.zarr')
zarr_data = zarr.create_array(
    store=self.store, 
    name='data',
    shape=(40000, 100), 
    chunks=(1000, 100), 
    dtype='f4')

zarr_data[:] = np.random.randn(10000, 100)

I have a separate python program that will access the entire array, but not all at once; rather, it will retrieve data in batches.

store = zarr.storage.LocalStore('mydata.zarr')
root = zarr.open(store=self.store, mode='r')

for i in range(root['data'].shape[0]):
    batch = root['data'][i]

My question is, if I were to run this code, would the entire dataset eventually end up being loaded into memory? and if so, is there is a way to avoid this? In my application, the full zarr array will be too large to fit in memory. I think the ideal solution would be a way to release previously read chunks from memory when I no longer need them. Is there a way to do this?

Experiments

I tried to run some code similar to the above scenario and measure the memory usage. As you can see, the zarr array contains 16MB of data, but it appears that the process memory usage only increases about ~2MB in the course of iterating over the zarr array. Here is the code and a visualization of the memory usage:

Code I ran:

def current_memory():
    process = psutil.Process(os.getpid())
    memory_usage = process.memory_info().rss
    memory_usage_mb = memory_usage / (1024 ** 2)
    return memory_usage_mb

init_mem = current_memory()
mem_measurements = []

store = zarr.storage.LocalStore('./memtest.zarr')
root = zarr.open(store=store, mode='r')

print(root['data'].info)

batch_size=10
for i in range(0, root['data'].shape[0] - batch_size, batch_size):
    batch = root['data'][i:i+batch_size]
    mem_measurements.append(current_memory() - init_mem)

Output of print(root['data'].info):

Type               : Array
Zarr format        : 3
Data type          : DataType.float32
Shape              : (40000, 100)
Chunk shape        : (1000, 100)
Order              : C
Read-only          : False
Store type         : LocalStore
Filters            : ()
Serializer         : BytesCodec(endian=<Endian.little: 'little'>)
Compressors        : (ZstdCodec(level=0, checksum=False),)
No. bytes          : 16000000 (15.3M)

Measured memory usage: memory usage over batches

Closing thoughts

It seems like zarr doesn't load the entire array into memory when iterating over it. The next question is; why is that? does zarr cache anything or keep chunks in memory?

Upvotes: 1

Views: 64

Answers (1)

IanDunn
IanDunn

Reputation: 21

Upon any access to a zarr array, zarr will read the entirety of chunks "touched" by the user-specified selection. Zarr will return the user-specified selection, and then "throw away" (discard from memory) the bytes that were originally read into memory. So if you loop over a zarr array like in my example, you will never read the entire array into memory; only the chunks of the data that your accesses touch will ever be in memory. So if your slicing is always constrained to 1 chunk, only 1 chunk will ever be held in memory at a time.

Upvotes: 1

Related Questions