Lianghai Wu
Lianghai Wu

Reputation: 13

python xarray write to netcdf file very slow

for fname in ids['fnames']:
    aq = xr.open_dataset(fname, chunks='auto', mask_and_scale=False)
    aq = aq[var_lists]

    aq = aq.isel(lat=slice(yoff, yoff+ysize), lon=slice(xoff, xoff+xsize))
    list_of_ds.append(aq)
    aq.close()

all_ds = xr.concat(list_of_ds, dim='time')
all_ds.to_netcdf('tmp.nc')

Hi all, I am making use of xarray to read netcdf files (around 1000) and save selected resutls to a temporary file, as shown above. However, the saving part runs very slow. How can I speed this up?

I also tried directly load the data, but still very slow.

I've also tried using open_mfdataset with parallel=True, and it's also slow:

aq = xr.open_mfdataset(
    sorted(ids_list),
    data_vars=var_lists,
    preprocess=add_time_dim,
    combine='by_coords',
    mask_and_scale=False,
    decode_cf=False,
    parallel=True,
)

aq.isel({'lon':irlon,'lat':irlat}).to_netcdf('tmp.nc')

Upvotes: 1

Views: 1581

Answers (1)

Michael Delgado
Michael Delgado

Reputation: 15452

Unfortunately, concatenating ~1000 files in xarray will be slow. Not a great way around that.

It's hard for us to offer specific advice without more detail about your data and setup. But here are some things I'd try:

  • use xr.open_mfdataset. Your second code block looks great. dask will generally be faster and more efficient at managing tasks than you will with a for loop.
  • Make sure your chunks are aligned with how you're slicing the data. You don't want to read in more than you have to. If you're reading netCDFs, you have flexibility about how to read in the data into dask. Since you're selecting (it looks like) a small spatial region within each array, it may make sense to explicitly chunk the data such that you're only reading in a small portion of each array, e.g. with chunks={"lat": 50, "lon": 50}. You'll want to balance a few things here - making sure the chunk sizes are manageable and not too small (leading to too many tasks). Shoot for chunks ~100-500 MB range as a general rule, and trying to keep the number of tasks to less than 1 million (or # chunks to fewer than ~10-100k across all your datasets).
  • Be explicit about your concatenation. The more "magic" the process feels, the more work xarray is doing to infer what you mean. Generally, combine='nested' performs better than 'by_coords', so if you're concatenating files which are structured logically along one or more dimensions, it may help to arrange the files in the same way a dim is provided.
  • skip the pre-processing. If you can, add new dimensions on concatenation rather than as an ingestion step. This allows dask to more fully plan the computation, rather than treating your preprocess function as a black box, and what's worse as a pre-requisite to scheduling the final array construction operation because you're using combine='by_coords', where the coords are the result of an earlier dask operation. If you need to attach a time dim to each file, with 1 element per file, something like xr.open_mfdataset(files, concat_dim=pd.Index(pd.date_range("2020-01-01", freq="D", periods=1000), name="time"), combine="nested") works well in my experience.

If this is all taking too long, you could try pre-processing the data. Using a compiled utility like nco or even just subsetting the data and grouping smaller subsets of the data into larger files using dask.distributed's client.map might help cut down on the complexity of the final dataset join.

Upvotes: 2

Related Questions