Reputation: 21

slow performance using xarray.DataArray.quantile() on large dataset

I am using xarray in pyhton (Spyder) to read large NetCDF-files and process them.

import xarray as xr
ds = xr.open_dataset('my_file.nc')

ds has the following dimensions and variables:

<xarray.Dataset>
Dimensions:    (time: 62215, points: 2195)
Coordinates:
  * time       (time) datetime64[ns] 1980-04-01 ... 2021-09-30T21:00:00
Dimensions without coordinates: points
Data variables:
    longitude  (time, points) float32 ...
    latitude   (time, points) float32 ...
    hs         (time, points) float32 ...

I want to calculate the 95th percentile of the variable hs for each specific point, and generate a new variable to the dataset:
hs_95 (points) float32

I do this with one line of code:

ds['hs_95'] = ds.hs.quantile(0.95, dim='time')

Where ds.hs is a xr.DataArray.

But it takes a very long time to run. Is there anything I can do to make it run faster? Is xarray the most convenient to use for this application?

Upvotes: 2

Answers (2)

Michael Delgado

Reputation: 15452

Migrating my comment into an answer...

xarray loads data from netCDFs lazily, only reading in the parts of the data which are requested for an operation. So the first time you work with the data, you'll be getting the read time + the quantile time. The quantiling may still be slow, but for a real benchmark you should first load the dataset with xr.Dataset.load(), e.g.:

ds = ds.load()

or alternatively, you can load the data and close the file object together with xr.load_dataset(filpath).

That said, you should definitely heed @tekiz's great advice to use skipna=False if you can - the performance improvement can be on the order of 100x if you don't have to skip NaNs when quantiling (if you're sure you don't have NaNs).

Upvotes: 1

tearis

Reputation: 53

Can you try skipna=False in xarray.DataArray.quantile() method? This could help a bit.

Upvotes: 1

slow performance using xarray.DataArray.quantile() on large dataset

Answers (2)

Related Questions