pbreach
pbreach

Reputation: 16987

Subsetting xarray.Dataset with respect to multiple coordinates

Say I have an xarray.Dataset object loaded in using xarray.open_dataset(..., decode_times=False) that looks like this when printed:

<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 15, lon: 34, plev: 8, time: 3650)
Coordinates:
  * time       (time) float64 3.322e+04 3.322e+04 3.322e+04 3.322e+04 ...
  * plev       (plev) float64 1e+05 8.5e+04 7e+04 5e+04 2.5e+04 1e+04 5e+03 ...
  * lat        (lat) float64 40.46 43.25 46.04 48.84 51.63 54.42 57.21 60.0 ...
  * lon        (lon) float64 216.6 219.4 222.2 225.0 227.8 230.6 233.4 236.2 ...
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) float64 3.322e+04 3.322e+04 3.322e+04 3.322e+04 ...
    lat_bnds   (lat, bnds) float64 39.07 41.86 41.86 44.65 44.65 47.44 47.44 ...
    lon_bnds   (lon, bnds) float64 215.2 218.0 218.0 220.8 220.8 223.6 223.6 ...
    hus        (time, plev, lat, lon) float64 0.006508 0.007438 0.008751 ...

What would be the best way to subset this given multiple ranges for lat, lon, and time? I've tried chaining a series of conditions and used xarray.Dataset.where, but I get an error saying:

IndexError: The indexing operation you are attempting to perform is not valid on netCDF4.Variable object. Try loading your data into memory first by calling .load().

I can't load the entire dataset into memory, so what would be the typical way to do this?

Upvotes: 4

Views: 6980

Answers (1)

shoyer
shoyer

Reputation: 9603

NetCDF4 doesn't support all of the multi-dimensional indexing operations supported by NumPy. But does support slicing (which is very fast) and one dimensional indexing (somewhat slower).

Some things to try:

  • Index with slices (e.g., .sel(time=slice(start, end))) before indexing with 1-dimensional arrays. This should offload the array-based indexing from netCDF4 to Dask/NumPy.
  • Split up your indexing operations into more intermediate operations that index along fewer dimensions at once. It sounds like you've already tried this one, but maybe it's worth exploring a little more.
  • To optimize performance, try different Dask chunking schemes using the .chunk().

If that doesn't work, post a full self-contained example to the xarray issue tracker on GitHub and we can take a look into it in more detail.

Upvotes: 3

Related Questions