Reputation: 47
I am trying to perform a fairly simplistic operation on a dataset involving editing of variable and global attributes on individual netcdf files of 3.5GB each. The files load instantly using xr.open_dataset
but dataset.to_netcdf()
is too slow to export after the modifications.
I have tried :
load()
before to_netcdf
persist()
or compute ()
before to_netcdf
I am working on a HPC with 10 distributed workers . In all cases, the time taken is more than 15 minutes per file. Is it expected? What else can I try to speed up this process apart from further parallelizing the single file operations using dask delayed?
Upvotes: 0
Views: 1648
Reputation: 28683
First a quick note:
The files load instantly using xr.open_dataset
You probably did not actually load the data at this point, only the metadata. Depending on your IO and compression/encoding, it might take considerable CPU and memory to load your data. You should have an idea of how much time you think it ought to take with a single CPU thread.
To answer our question: netCDF (HDF5) does not play nicely with parallel writing. You will likely find that only one task is writing at a time because of locking, or even that the output data is all going to a single task before writing, regardless of your chunking. Please check your dask dashboard!
May I recommend that you try the zarr format, which works well for parallel applications, because each chunk is in a different file. You still need to make decisions on the correct chunking of your data (example).
Upvotes: 1