Reputation: 395
I am aiming to merge annual NetCDF files into a single NetCDF file for future use in a model. The model requires multi-year data supplied to it at the start of its run, so having a single file with multi-year data is a necessity.
I am working in Oracle Cloud Infrastructure (OCI) in a Python notebook on their 'Data Science' resource. I have stored all of the NetCDFs required in a storage bucket and due to their large size, it is infeasible for me to download them to a local machine and perform this step locally.
Here is a MRE of my code, on the assumption that the NetCDFs are in the given file paths, and that this can't actually be reproduced outside of the OCI environment:
def join_annual_netcdfs(netcdf_paths_dict, output_dir):
# Loop through all variable, model and scenario names from dict
for var,models in netcdf_paths_dict.items():
for model,scenarios in models.items():
for scenario,file_paths in scenarios.items():
# Open multiple files and join by coords (lazy loading)
ds = xr.open_mfdataset(file_paths, combine='by_coords')
# Setup output filepath
output_path = os.path.join(output_dir, f'combined_{var}_{model}_{scenario}.nc')
# Save joined dataset
ds.to_netcdf(output_path)
The above code is what I would like to work. In this state, the error returned is to do with connecting to the storage bucket, and can be fixed using 'fsspec':
with fsspec.open(path, mode='rb', oci_kwargs={'namespace': namespace}) as f:
ds = xr.open_dataset(f)
Which then successfully opens a single NetCDF file. However, when I try to loop with this approach and ultimately merge the datasets afterwards, I (predictably) encounter an I/O issue because each file connection is closed after leaving its 'with' block. e.g.
def join_annual_netcdfs(netcdf_paths_dict, output_dir):
# Loop through all variable, model and scenario names from dict
for var,models in netcdf_paths_dict.items():
for model,scenarios in models.items():
for scenario,file_paths in scenarios.items():
datasets = []
# Open each dataset and append to list
for file in file_paths:
with fs.open(path, mode='rb') as f:
ds = xr.open_dataset(f)
datasets.append(ds)
# Merge datasets
merged_ds = xr.merge(datasets)
# Save joined dataset
merged_ds .to_netcdf(output_path)
Can anyone see a way around the limit of opening a single connection to the bucket at once and needing to load multiple files simultaneously?
Upvotes: 0
Views: 54