Reputation: 153
I have a very big NetCDF
file.
I tried to use the dask.array
feature in python xarray
module and specified the chunk size when I opened this data. It worked fine; however, when I tried to load the variables to memory using .load(), it was super slow.
I wonder is there any option (in xarray
or other python module) to read in subset of a NetCDF
file by providing indices of dimensions (lat, lon)? That way I can directly apply functions to the subset file without using the dask.array
.
Upvotes: 3
Views: 2534
Reputation: 9603
This issue sounds similar to those discussed in https://github.com/pydata/xarray/issues/1396, but if you're using recent versions of dask that problem should be resolved.
You can potentially improve performance by avoiding explicit chunking until after indexing, e.g., just
tmax = xr.open_mfdataset(terra_climate_dir+'tmax.nc')
tmax_pos = tmax.sel(lat=39.9042,lon=116.4074,method='nearest').compute()
If this doesn't help, then the issue may related to your source data. For example, queries may be slow if data is accessed over a network mounted drive, or if data is loaded netCDF4 files with in-file chunking/compression (which requires reading full chunks into memory).
Upvotes: 1
Reputation: 1406
You can slice the data before loading the variable into memory.
ds = xr.open_dataset('path/to/file')
in_memory = ds.isel(x=slice(10, 1000)).load()
Upvotes: 3