Reputation: 13
for fname in ids['fnames']:
aq = xr.open_dataset(fname, chunks='auto', mask_and_scale=False)
aq = aq[var_lists]
aq = aq.isel(lat=slice(yoff, yoff+ysize), lon=slice(xoff, xoff+xsize))
list_of_ds.append(aq)
aq.close()
all_ds = xr.concat(list_of_ds, dim='time')
all_ds.to_netcdf('tmp.nc')
Hi all, I am making use of xarray to read netcdf files (around 1000) and save selected resutls to a temporary file, as shown above. However, the saving part runs very slow. How can I speed this up?
I also tried directly load the data, but still very slow.
I've also tried using open_mfdataset
with parallel=True
, and it's also slow:
aq = xr.open_mfdataset(
sorted(ids_list),
data_vars=var_lists,
preprocess=add_time_dim,
combine='by_coords',
mask_and_scale=False,
decode_cf=False,
parallel=True,
)
aq.isel({'lon':irlon,'lat':irlat}).to_netcdf('tmp.nc')
Upvotes: 1
Views: 1581
Reputation: 15452
Unfortunately, concatenating ~1000 files in xarray will be slow. Not a great way around that.
It's hard for us to offer specific advice without more detail about your data and setup. But here are some things I'd try:
xr.open_mfdataset
. Your second code block looks great. dask will generally be faster and more efficient at managing tasks than you will with a for loop.chunks={"lat": 50, "lon": 50}
. You'll want to balance a few things here - making sure the chunk sizes are manageable and not too small (leading to too many tasks). Shoot for chunks ~100-500 MB range as a general rule, and trying to keep the number of tasks to less than 1 million (or # chunks to fewer than ~10-100k across all your datasets).combine='nested'
performs better than 'by_coords'
, so if you're concatenating files which are structured logically along one or more dimensions, it may help to arrange the files in the same way a dim is provided.combine='by_coords'
, where the coords are the result of an earlier dask operation. If you need to attach a time dim to each file, with 1 element per file, something like xr.open_mfdataset(files, concat_dim=pd.Index(pd.date_range("2020-01-01", freq="D", periods=1000), name="time"), combine="nested")
works well in my experience.If this is all taking too long, you could try pre-processing the data. Using a compiled utility like nco
or even just subsetting the data and grouping smaller subsets of the data into larger files using dask.distributed
's client.map might help cut down on the complexity of the final dataset join.
Upvotes: 2