Allan Delautre
Allan Delautre

Reputation: 21

Is there a way to save into a zarr file an xarray, with the possibility of appending in multiple dimensions?

I'm currently doing an internship where I need to create large datasets, often hundreds of GB in size. I'm collecting temporal samples for cartography, where I collect 500 samples for each geographical point. Due to the large memory requirements, I save every 25 samples into separate Zarr files. Once the collection process is complete, I merge all these smaller Zarr files into one large Zarr file to create a single Xarray dataset.

This method works, but it requires a separate merging step after the data collection, which is time-consuming. I was wondering if there’s a way to directly append the data to the main all.zarr file during the collection process itself. Ideally, I’d like to save every 25 traces directly into the all.zarr file, using something like:

xrs_index.to_zarr(results_dir / "all.zarr", mode="a", append_dim=["position", "index"]) Is there a way to achieve this to streamline the process and reduce the overhead?

Here is the code I currently use to merge the smaller Zarr files:

files_list = list(results_dir.glob("results_*.zarr"))
nbfiles = len(files_list)
assert nbfiles % 25 == 0

with ProgressBar():
    for i in range(nbfiles // 25):
        files_list_part = files_list[i * 25:(i + 1) * 25]
        xrs = [xr.open_zarr(f).squeeze() for f in tqdm(files_list_part)]
        xrs_index = xr.combine_by_coords(xrs)
        xrs_index = xrs_index.expand_dims({"position": [i]})
        xrs_index.to_zarr(results_dir / "all.zarr", mode="a", append_dim="position")

Upvotes: 2

Views: 366

Answers (1)

mdurant
mdurant

Reputation: 28684

You should be able to use zarr directly to set up your output group/zarray. Since zarr assumes "nan" (or other fill value) for parts of the array that contain no data, you can make your output array arbitrarily large in any dimension(s), and then write to it, filling in the parts where you get data. In your case, it sounds like you know the eventual total size of the dataset.

To bootstrap what the set of arrays will look like, you could do to_zarr on the first component and see what you get.

first_data.to_zarr("results_0.zarr")

creates a directory with .zgroup, and subdirectories for each variable (including coordinates)

You can then either use zarr's API to recreate an empty dataset of the same structure, or edit the .zarray files in the directories you want to change, and set the "shape" attribute. At writing time, you will need more "r+" for "update data files, don't change metadata".

Alternative

You already have your 25 data sets. It these are lazy, you can concatenate/merge them using the xarray API, and then do .to_zarr in one shot, allowing xarray to figure things out for you. You probably need dask to schedule your expensive computations for each subset, and compute them out-of-core (releasing memory when done).

This is probably the better workflow for future expansion, but requires some learning about how to make xarray datasets from dask delayed calls. https://docs.xarray.dev/en/latest/user-guide/dask.html

Upvotes: -1

Related Questions