Is there a way to save into a zarr file an xarray, with the possibility of appending in multiple dimensions?

Question

I'm currently doing an internship where I need to create large datasets, often hundreds of GB in size. I'm collecting temporal samples for cartography, where I collect 500 samples for each geographical point. Due to the large memory requirements, I save every 25 samples into separate Zarr files. Once the collection process is complete, I merge all these smaller Zarr files into one large Zarr file to create a single Xarray dataset.

This method works, but it requires a separate merging step after the data collection, which is time-consuming. I was wondering if there’s a way to directly append the data to the main all.zarr file during the collection process itself. Ideally, I’d like to save every 25 traces directly into the all.zarr file, using something like:

xrs_index.to_zarr(results_dir / "all.zarr", mode="a", append_dim=["position", "index"]) Is there a way to achieve this to streamline the process and reduce the overhead?

Here is the code I currently use to merge the smaller Zarr files:

files_list = list(results_dir.glob("results_*.zarr"))
nbfiles = len(files_list)
assert nbfiles % 25 == 0

with ProgressBar():
    for i in range(nbfiles // 25):
        files_list_part = files_list[i * 25:(i + 1) * 25]
        xrs = [xr.open_zarr(f).squeeze() for f in tqdm(files_list_part)]
        xrs_index = xr.combine_by_coords(xrs)
        xrs_index = xrs_index.expand_dims({"position": [i]})
        xrs_index.to_zarr(results_dir / "all.zarr", mode="a", append_dim="position")

Is there a way to save into a zarr file an xarray, with the possibility of appending in multiple dimensions?

Answers (1)

Alternative

Related Questions