Peio Ziarsolo
Peio Ziarsolo

Reputation: 51

Dask array to zarr with unknown shapes

I am trying to store a dask array in a zarr file.

I have managed to do it when the dask array has a defined shape.


import dask
import dask.array as da
import numpy as np
from tempfile import TemporaryDirectory
import zarr


np_array = np.random.randint(1, 10, size=1000)
array = da.from_array(np_array)

with TemporaryDirectory() as tmpdir:
    delayed = da.to_zarr(array, url=tmpdir,
                         compute=False, component='/data')
    dask.compute(delayed)

     z_object = zarr.open_group(tmpdir, mode='r')

     assert np.all(np_array == z_object.data[:])

However if I have performed any operation with the dask array, the shape is lost and zarr complains about the Nans in the shape.

# this will fail

np_array = np.random.randint(1, 10, size=1000)
array = da.from_array(np_array)

array = array[array > 5]

with TemporaryDirectory() as tmpdir:
    delayed = da.to_zarr(array, url=tmpdir,
                         compute=False, component='/data')
    dask.compute(delayed)

    z_object = zarr.open_group(tmpdir, mode='r')

    assert np.all(np_array[np_array > 5] == z_object.data[:])

This is the raised error:

Traceback (most recent call last):
  File "/home/peio/devel/variation/variation6/variation6/tests/test_zarr.py", line 38, in <module>
    without_shape()
  File "/home/peio/devel/variation/variation6/variation6/tests/test_zarr.py", line 29, in without_shape
    compute=False, component='/data')
  File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/dask/array/core.py", line 2808, in to_zarr
    **kwargs
  File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/creation.py", line 120, in create
    chunk_store=chunk_store, filters=filters, object_codec=object_codec)
  File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/storage.py", line 323, in init_array
    object_codec=object_codec)
  File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/storage.py", line 343, in _init_array_metadata
    shape = normalize_shape(shape) + dtype.shape
  File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/util.py", line 58, in normalize_shape
    shape = tuple(int(s) for s in shape)
  File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/util.py", line 58, in <genexpr>
    shape = tuple(int(s) for s in shape)
ValueError: cannot convert float NaN to integer

Is there a way to store a dask array without known shape into a zarr file?

Thanks in advance!

Upvotes: 2

Views: 1049

Answers (1)

jakirkham
jakirkham

Reputation: 715

Zarr expects that chunk shapes are uniform and known beforehand. Dask facilitates this currently by rechunking the array to be uniform. However array[array > 5] creates a Dask Array with unknown chunk shapes. So there is no way to rechunk it to be uniform beforehand as the needed information is not present. That said, we could explain this better.

One could workaround this by using Dask operations that return known chunk shapes (as David suggests). Alternatively one could determine the chunk shapes before storing (at the cost of computing). We could also discuss extending Zarr to handle this case, but that is a longer term solution.

Upvotes: 2

Related Questions