Reputation: 335
I open data from the ERA5 Google Cloud Zarr archive. I do some refactoring (change time resolution, select Northern Hemisphere only, etc.), where the operations are applied on dask data.
This is how the xarray DataArray looks like:
Then I apply a numpy function that works on multi-dimensional data using xr.apply_ufunc
:
I tested two scenarios, one where the data is loaded first and ufunc is applied on numpy
and one where ufunc is applied on dask
data and the result computed afterwards.
This is fast.
This is slow.
I'd think that dask=parallelized
distributes the work to be done across multiple cores. Is numpy doing that as well? Why is applying ufunc to numpy still so much faster?
As soon as the dataset becomes large, it will be unfeasible to load the data into memory first (option 1). Therefore, I want to make sure that I'm not doing something stupid makes dask=parallelized
very slow.
Thanks.
Upvotes: 0
Views: 683
Reputation: 1031
If the array is small enough to comfortably fit into memory the overhead of spinning up dask workers, scheduling those workers and just the general I/O of splitting data, computing it and stitching it back together makes dask a lot slower.
Dask helps scaling up. It isn't good for speeding up array computations that fit completely into your memory.
If what you are doing is stupid really depends on what the function _to_lon_lat
is doing. But in general, if what you are doing in that function can be done with native xarray functions, then you shouldn't use apply_ufunc
but instead those xarray functions since all of them work with dask out of the box.
Upvotes: 1