Marcelo Andrioni
Marcelo Andrioni

Reputation: 436

In python xarray, how do I create and subset lazy variables without loading the whole dataarray?

I am trying to create a python function that opens a remote dataset (in a opendap server) using xarray and automatically creates new variables lazily. A use case would be to calculate magnitude and direction when u and v components are available, e.g.:

import xarray as xr
import dask
import dask.array as da


@dask.delayed
def uv2mag(u, v):
    return (u**2 + v**2)**0.5


@dask.delayed
def uv2dir(u, v):
    return np.rad2deg(np.arctan2(u, v))


def open_dataset(*args, **kwargs) -> xr.Dataset:

    uv = kwargs.pop("uv", None)

    ds = xr.open_dataset(*args, **kwargs)

    if uv:

        uvar, vvar = uv

        ds["magnitude"] = (
            ds[uvar].dims,
            da.from_delayed(uv2mag(ds[uvar], ds[vvar]),
                            ds[uvar].shape,
                            dtype=ds[uvar].dtype),
            {"long_name": "magnitude"},
        )

        ds["direction"] = (
            ds[uvar].dims,
            da.from_delayed(uv2dir(ds[uvar], ds[vvar]),
                            ds[uvar].shape,
                            dtype=ds[uvar].dtype),
            {"long_name": "direction"},
        )

    return ds


url = "https://tds.hycom.org/thredds/dodsC/FMRC_ESPC-D-V02_uv3z/FMRC_ESPC-D-V02_uv3z_best.ncd"
uvar = "water_u"
vvar = "water_v"

ds = open_dataset(url, drop_variables="tau", uv=[uvar, vvar])

On first look everything seems to work just fine and the new magnitude and direction variables are created.

<xarray.Dataset>
Dimensions:      (depth: 40, lat: 4251, lon: 4500, time: 121)
Coordinates:
  * depth        (depth) float64 0.0 2.0 4.0 6.0 ... 2.5e+03 3e+03 4e+03 5e+03
  * lat          (lat) float64 -80.0 -79.96 -79.92 -79.88 ... 89.92 89.96 90.0
  * lon          (lon) float64 0.0 0.07996 0.16 0.24 ... 359.7 359.8 359.8 359.9
  * time         (time) datetime64[ns] 2024-09-29T12:00:00 ... 2024-10-14T12:...
    time_run     (time) datetime64[ns] ...
Data variables:
    time_offset  (time) datetime64[ns] ...
    water_u      (time, depth, lat, lon) float32 ...
    water_v      (time, depth, lat, lon) float32 ...
    magnitude    (time, depth, lat, lon) float32 dask.array<chunksize=(121, 40, 4251, 4500), meta=np.ndarray>
    direction    (time, depth, lat, lon) float32 dask.array<chunksize=(121, 40, 4251, 4500), meta=np.ndarray>
Attributes: (12/22)
    classification_level:      
    distribution_statement:    Approved for public release; distribution unli...
    downgrade_date:            not applicable
    classification_authority:  not applicable
    institution:               Fleet Numerical Meteorology and Oceanography C...
    source:                    HYCOM archive file, GLBz0.04
                       ...
    time_origin:               2024-10-05 12:00:00
    _CoordSysBuilder:          ucar.nc2.dataset.conv.CF1Convention
    cdm_data_type:             GRID
    featureType:               GRID
    location:                  Proto fmrc:FMRC_ESPC-D-V02_uv3z
    history:                   FMRC Best Dataset

The problem arises when I try to get a very small subset of the data. A "real" variable returns the data instantly, e.g.:

>>> ds.isel(time=slice(0, 5), depth=0, lat=1000, lon=1000)["water_u"].values
array([0.27400002, 0.23600002, 0.12      , 0.108     , 0.24400002],
      dtype=float32)

But when I try to get a subset from the lazy variable, an error is raised due to trying to load the whole 4D matrix in memory, not just the subset.

>>> ds.isel(time=slice(0, 5), depth=0, lat=1000, lon=1000)["magnitude"].values
MemoryError: Unable to allocate 345. GiB for an array with shape (121, 40, 4251, 4500) and data type float32

Does anyone have a suggestion of how thin can be fixed?

Thank you.

My expectation was that the subset performed by isel would also be applied in the lazy variables, and that the loading in memory would be equally as fast as the "real" variables.

Upvotes: 1

Views: 86

Answers (1)

Marcelo Andrioni
Marcelo Andrioni

Reputation: 436

Following @guillaume-eb suggestion, I removed the dask.delayed option and used only simple Dask Arrays. The code seems to work just fine now. Thank you.

import datetime
import numpy as np
import xarray as xr
import dask.array as dask_array


def uv2mag(u: xr.DataArray,
           v: xr.DataArray
           ) -> dask_array:
    return (u.data**2 + v.data**2)**0.5


def uv2dir(u: xr.DataArray,
           v: xr.DataArray
           ) -> dask_array:
    return np.rad2deg(np.arctan2(u.data, v.data))


def open_dataset(*args, **kwargs) -> xr.Dataset:

    uv = kwargs.pop("uv", None)

    ds = xr.open_dataset(*args, **kwargs)

    if uv:

        uvar, vvar = uv

        ds["magnitude"] = (
            ds[uvar].dims,
            uv2mag(ds[uvar], ds[vvar]),
            {"long_name": "magnitude"},
        )

        ds["direction"] = (
            ds[uvar].dims,
            uv2dir(ds[uvar], ds[vvar]),
            {"long_name": "direction"},
        )

    return ds


url = "https://tds.hycom.org/thredds/dodsC/FMRC_ESPC-D-V02_uv3z/FMRC_ESPC-D-V02_uv3z_best.ncd"
uvar = "water_u"
vvar = "water_v"

ds = open_dataset(url, drop_variables="tau", chunks=10, uv=[uvar, vvar])
ds2 = ds.isel(time=slice(0, 5), depth=0, lat=1000, lon=1000)
ds2.load()

Upvotes: 1

Related Questions