Tong Qiu
Tong Qiu

Reputation: 153

read subset of a single NetCDF file using slices of dimensions

I have a very big NetCDF file.

I tried to use the dask.array feature in python xarray module and specified the chunk size when I opened this data. It worked fine; however, when I tried to load the variables to memory using .load(), it was super slow.

I wonder is there any option (in xarray or other python module) to read in subset of a NetCDF file by providing indices of dimensions (lat, lon)? That way I can directly apply functions to the subset file without using the dask.array.

Upvotes: 3

Views: 2534

Answers (2)

shoyer
shoyer

Reputation: 9603

This issue sounds similar to those discussed in https://github.com/pydata/xarray/issues/1396, but if you're using recent versions of dask that problem should be resolved.

You can potentially improve performance by avoiding explicit chunking until after indexing, e.g., just

tmax = xr.open_mfdataset(terra_climate_dir+'tmax.nc')
tmax_pos = tmax.sel(lat=39.9042,lon=116.4074,method='nearest').compute()

If this doesn't help, then the issue may related to your source data. For example, queries may be slow if data is accessed over a network mounted drive, or if data is loaded netCDF4 files with in-file chunking/compression (which requires reading full chunks into memory).

Upvotes: 1

Keisuke FUJII
Keisuke FUJII

Reputation: 1406

You can slice the data before loading the variable into memory.

ds = xr.open_dataset('path/to/file')
in_memory = ds.isel(x=slice(10, 1000)).load()

Upvotes: 3

Related Questions