Bowhaven
Bowhaven

Reputation: 395

How to read multiple NetCDF files from Oracle bucket storage using Python SDK in a Data Science notebook?

I am aiming to merge annual NetCDF files into a single NetCDF file for future use in a model. The model requires multi-year data supplied to it at the start of its run, so having a single file with multi-year data is a necessity.

I am working in Oracle Cloud Infrastructure (OCI) in a Python notebook on their 'Data Science' resource. I have stored all of the NetCDFs required in a storage bucket and due to their large size, it is infeasible for me to download them to a local machine and perform this step locally.

Here is a MRE of my code, on the assumption that the NetCDFs are in the given file paths, and that this can't actually be reproduced outside of the OCI environment:

def join_annual_netcdfs(netcdf_paths_dict, output_dir):
        
    # Loop through all variable, model and scenario names from dict
    for var,models in netcdf_paths_dict.items():
        for model,scenarios in models.items():
            for scenario,file_paths in scenarios.items():
                
                # Open multiple files and join by coords (lazy loading)
                ds = xr.open_mfdataset(file_paths, combine='by_coords')
                
                # Setup output filepath
                output_path = os.path.join(output_dir, f'combined_{var}_{model}_{scenario}.nc')
                
                # Save joined dataset
                ds.to_netcdf(output_path)

The above code is what I would like to work. In this state, the error returned is to do with connecting to the storage bucket, and can be fixed using 'fsspec':

with fsspec.open(path, mode='rb', oci_kwargs={'namespace': namespace}) as f:
    ds = xr.open_dataset(f)

Which then successfully opens a single NetCDF file. However, when I try to loop with this approach and ultimately merge the datasets afterwards, I (predictably) encounter an I/O issue because each file connection is closed after leaving its 'with' block. e.g.

def join_annual_netcdfs(netcdf_paths_dict, output_dir):
        
    # Loop through all variable, model and scenario names from dict
    for var,models in netcdf_paths_dict.items():
        for model,scenarios in models.items():
            for scenario,file_paths in scenarios.items():
                
                datasets = []
                # Open each dataset and append to list
                for file in file_paths:
                    with fs.open(path, mode='rb') as f:
                            ds = xr.open_dataset(f)
                            datasets.append(ds)

                # Merge datasets            
                merged_ds = xr.merge(datasets)
                
                # Save joined dataset
                merged_ds .to_netcdf(output_path)

Can anyone see a way around the limit of opening a single connection to the bucket at once and needing to load multiple files simultaneously?

Upvotes: 0

Views: 54

Answers (0)

Related Questions