xarray Dataset swap_dims for multi-dimensional coord

Question

I want to swap two coord dimensions with a multidimensional coord so I can perform a groupby by time.month and subtracting another dataset.

import xarray as xr

ds = xr.Dataset()

# DataArray indexed by 'init_time' and an offset, 'tau'
ds['tst'] = xr.DataArray(
    [[0, 1, 2], [3, 4, 5]],
    dims=('init_time', 'tau'),
    coords={
        'init_time': pd.date_range('2017-01-01', periods=2),
        'tau': pd.to_timedelta([1, 2, 3], unit='days')})

# multidimensional coordinate 'time'
ds.coords['time'] = ds['init_time'] + ds['tau']

ds.swap_dims({('init_time', 'tau'): 'time'})

ds

kind of like the result of this:

clim = pd.Series([2], index=[1]).rename_axis('month')
df = ds.to_dataframe().reset_index()
df['month'] = df['time'].dt.month
df = (
    pd.DataFrame(
        df.set_index(['init_time', 'tau', 'time', 'month'])['tst']
        - clim))

df

Michael Delgado · Accepted Answer

The issue with this is that swapping dims would result in duplicate values in the index. Ideally, you'd be able to do a groupby on a multi-dimensional coordinate. You can currently do this, but it's not fully featured (for example, you can't do ds.groupby('time.month').mean(dim='time')). This looks like it may be in the works (see #324, #2525).

Right now, I think you have two options. You could do this in pandas:

df = ds.to_dataframe().reset_index()
monthly_mean = (
    df
    .groupby([df.other_dims, df.time.dt.month])
    .mean()[['tst']]
    .to_xarray())

clim = xr.DataArray([2], dims=['month'], coords=[[1]])

anom = monthly_mean.rename({'time': 'month'}) - clim

Alternatively, you could keep it in xarray by stacking the init_time and tau:

In [35]: stacked = ds.stack(obs=('init_time', 'tau'))

In [36]: stacked.coords['obs_num'] = ('obs', ), np.arange(len(stacked.obs))

In [37]: stacked.coords['time'] = ('obs', ), stacked.init_time + stacked.tau

In [38]: swapped = stacked.swap_dims({'obs': 'obs_num'})

In [39]: swapped
Out[39]:

Dimensions:  (obs_num: 150)
Coordinates:
    time     (obs_num) datetime64[ns] 2017-01-01 2017-01-02 ... 2017-02-03
    obs      (obs_num) object (Timestamp('2017-01-01 00:00:00', freq='D'), Timedelta('0 days 00:00:00')) ... (Timestamp('2017-01-30 00:00:00', freq='D'), Timedelta('4 days 00:00:00'))
  * obs_num  (obs_num) int64 0 1 2 3 4 5 6 7 ... 142 143 144 145 146 147 148 149
Data variables:
    tst      (obs_num) int64 0 1 2 3 4 5 6 7 ... 142 143 144 145 146 147 148 149

In [47]: swapped.groupby(swapped.time.dt.month).mean(dim='obs_num')
Out[47]:

Dimensions:  (month: 2)
Coordinates:
  * month    (month) int64 1 2
Data variables:
    tst      (month) float64 71.56 145.0

xarray Dataset swap_dims for multi-dimensional coord

Answers (1)

Related Questions