Omi
Omi

Reputation: 79

How to combine 'variables' from multiple NetCDF files into a single NetCDF file?

I am working with ocean currents data generated using ROMS model output in NetCDF format. Each NetCDF file contains the monthly mean of ocean currents velocity i.e., only one time step. So far I've reached here.

import netCDF4
import pandas as pd
import numpy as np
import xarray as xr
df1 = xr.open_dataset("ocean_avg_November.nc4") #NetCDF file for Nov 2013
df2 = xr.open_dataset("ocean_avg_December.nc4") #NetCDF file for Dec 2013
du1 = df1['u'] # x sea water velocity
dv1 = df1['v'] # y sea water velocity
dw1 = df1['w'] # upward sea water velocity
du2 = df2['u']
dv2 = df2['v']
dw2 = df2['w']

Now, how to combine du1, dv1 & dw1 to du2, dv2 and dw2 to create a single NetCDF file containing time series of ocean current data i.e., here for two-time steps, November and December. Do I need to use xarray.merge or xarray.concat or some other function? I'm new to Xarray and python. Any help in solving this issue is appreciated.

Upvotes: 1

Views: 2403

Answers (1)

Michael Delgado
Michael Delgado

Reputation: 15452

xarray has great documentation on combining data, and I highly recommend giving them a close read! But sometimes it can be confusing if you're just getting started which operation to use. Also, if you have specific feedback on which parts of the documentation you found confusing, I'm sure the xarray devs would love that feedback (esp if you're willing to make a contribution to the docs yourself)!

There are generally four ways to combine data. Directly from the docs:

  • For combining datasets or data arrays along a single dimension, see concatenate.
  • For combining datasets with different variables, see merge.
  • For combining datasets or data arrays with different indexes or missing values, see combine.
  • For combining datasets or data arrays along multiple dimensions see combining along multiple dimensions.

From your question, it looks like you have two datasets which are distinct only in the month of data represented. Other than the time component, it sounds like the two datasets are the same, each with u, v, and w variables, and with the dimensions of these variables consistent between the two Datasets with the exception of the time dimension. Because of this, this seems like a perfect use case for concatenate. Concatenation just means joining two arrays together by placing them next to each other along a single axis to form a single, larger array. When you concatenate datasets, xarray automatically concatenates each array within the dataset.

Merge is more appropriate if you have two datasets that are similar in all of their dimensions, but differ in which variables are present. For example, if you had three datasets, all of which the same dims, but one had the u variable, the second had v, and the third dataset had w, then we would combine these variables into one larger dataset with three variables (and the same dims) using merge.

Now that we now which approach to take, we're ready to start concatenating. The actual implementation will depend a bit on whether the data has a time dimension, with each file having only one value along this dimension, or if there's no time dimension at all.

If the concatenation dim is already present in the data

If the time dimension is already present, this is very easy - all we need to do is tell xarray to concatenate along time.

Using the data you've already read in, we can use xr.concat to combine along any single dim:

# I'm using the more standard variable names "ds" to avoid confusion 
# with pandas DataFrames, but these refer to df1 and df2 in your question
ds_merged = xr.concat([ds1, ds2], dim="time")

Alternately, you could concatenate the arrays as you read them in, by using xr.open_mfdataset. The syntax is similar:

fps = ["ocean_avg_November.nc4", "ocean_avg_December.nc4"]
ds = xr.open_mfdataset(fps, concat_dim="time")

If the concatenation dim is not present

If your data does not yet have a time dimension, we'll need to tell xarray how to differentiate between the two arrays in time. We can do this in a couple of ways. You could expand the dimensionality of the arrays first, using xr.Dataset.expand_dims, e.g. ds1.expand_dims(time=['2013-11-01']), and the same for ds2, and then concatenate the datasets as above. This makes it very clear what's going on, but it has a slight disadvantage of being slower, since you'll need to resize your arrays twice.

A faster option is to define your dimension as you concatenate. To do this, we'll create a pandas DatetimeIndex object manually with pd.to_datetime, which will form the new dimension.

new_dimension = pd.to_datetime(["2013-11-01", "2013-12-01"], name='time')
ds = pd.concat([ds1, ds2], dim=new_dimension)

Similarly, we can use the DatetimeIndex as we read in the data:

ds = xr.open_mfdataset(fps, concat_dim=new_dimension)

When doing this, we do need to be careful to make sure that the order of the datasets (or filepaths) is consistent with the order of the dates in the new dimension, because we're manually pairing them.

If you want only a subset of the variables in each dataset

The above methods will work for either a single variable (or DataArray), or for all of the arrays in a dataset (xarray will apply the combine rules to all variables and coordinates automatically).

If you're trying to concatenate only some of the available variables (let's say each file had variables u, v, w, x, y, and z), you could filter them using the above methods ahead of time, or when reading them.

Using xr.concat:

ds = xr.concat([ds1[["u", "v", "w"]], ds2[["u", "v", "w"]]], dim="time")

or using the data_vars argument to xr.open_mfdataset:

ds = xr.open_mfdataset(fps, data_vars=["u", "v", "w"], concat_dim="time")

Upvotes: 4

Related Questions