Reputation: 26
I have a zarr dataset on disk, which I open with xarray using:
import xarray as xr
import numpy as np
import dask.distributed as dd
# setup dask
cluster = dd.LocalCluster()
client = dd.Client(cluster)
# open datset
ds = xr.open_zarr('...')
Once open, the dataset looks like this:
Note that the window
dimension represents time and the clm
dimension represents spectral coefficients.
Now, I want to manipulate this dataset, for example interpolate it into grid space, which I do as follows:
def spec_to_grid_numpy(f_clm):
# first fold the clm coefficients
f_clm_folded = ...
# then apply Legendre transform
leg_f = np.einsum(...)
# finally apply HFFT
f = np.fft.hfft(...)
return f
ds_grid = xr.apply_ufunc(
spec_to_grid_numpy,
ds,
input_core_dims=[['clm']],
output_core_dims=[['latitude', 'longitude']],
dask='parallelized',
output_dtypes=['float64'],
dask_gufunc_kwargs=dict(
output_sizes=dict(latitude=num_lat, longitude=num_lon),
),
)
ds_grid.to_zarr('...')
However, when I run this code, I get the following warning:
/libre/farchia/programs/miniforge3/envs/ptifs/lib/python3.10/site-packages/distributed/client.py:3361: UserWarning: Sending large graph of size 22.31 MiB.
This may cause some slowdown.
Consider loading the data with Dask directly or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.
warnings.warn(
This raises several questions.
window
dimension)? Will dask be still able to handle the dataset? Or will I have to manually split the dataset into several subsets along the window
dimension, and process each subset independently?Edit
I have plotted the task graph for a subset of the data:
Upvotes: 0
Views: 274