hlee
hlee

Reputation: 343

best way to subtract mean monthly values for each grid in Python xarray

A toy dataset from here:

import numpy as np
import pandas as pd
import seaborn as sns

import xarray as xr

np.random.seed(123)

xr.set_options(display_style="html")

times = pd.date_range("2000-01-01", "2001-12-31", name="time")
annual_cycle = np.sin(2 * np.pi * (times.dayofyear.values / 365.25 - 0.28))

base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)

ds = xr.Dataset(
    {
        "tmin": (("time", "location"), tmin_values),
        "tmax": (("time", "location"), tmax_values),
    },
    {"time": times, "location": ["IA", "IN", "IL"]},
)

I know that here I can find how to subtract mean monthly values from a variable in xarray.DataSet() as below:

climatology = ds.groupby("time.month").mean("time")
anomalies = ds.groupby("time.month") - climatology
anomalies.mean("location").to_dataframe()[["tmin", "tmax"]].plot()

Then, can I do the subtraction for each location?

I tried to do it for location-month groups, but the xarray.DataSet.groupby() doesn't allow to pass multiple groups. Then, I tried to make the location-month using xarray.DataSet.stack(), but it only allows to pass dimension; I could extract month values using time.month but they were restored as a new variable, not a dimension. I can use for or xarray.DataSet.apply() for all locations, but it is too slow (I have about 65000 locations).

Expected results or processes are something like:

for each location:
    climatology = ds.groupby("time.month").mean("time")
    anomalies = ds.groupby("time.month") - climatology

A solution within only xarray would be the best, but if it is possible and quite fast using pd.DataFrame() or others, then those solutions are also welcome.

Edit

Here is my current solution using `pd.DataFrame()'

# convert to pd.dataframe
df = ds.to_dataframe()

# get mean monthly values
months = df.index.get_level_values('time').month
df_meanMonths = df.groupby([pd.Grouper(level='location'), months]).mean()

# rename and reindex
df_meanMonths.rename(columns={'tmin': 'tminMM', 'tmax': 'tmaxMM'}, inplace=True)
df_meanMonths.index.set_names('month', level='time', inplace=True)

# merge
df['month'] = df.index.get_level_values('time').month
vars_join = ['tminMM', 'tmaxMM']
join_right = df_meanMonths[vars_join]

# results
df.reset_index().set_index(['location', 'month']).merge(join_right, how='left', left_index=True, right_on=['location', 'month'])

Upvotes: 4

Views: 1043

Answers (1)

Agustín Begue
Agustín Begue

Reputation: 128

I think that you might be looking for is this:

anomalies = xr.apply_ufunc(
    lambda x, mean: x - mean, 
    ds.tmax.groupby('time.month'),
    ds.tmax.groupby('time.month').mean()
).drop('month')

for just the tmax variable (a DataArray) or

anomalies = xr.apply_ufunc(
    lambda x, means: x - means, 
    ds.groupby('time.month'),
    ds.groupby('time.month').mean()
).drop('month')

for all variables in the Dataset

Upvotes: 1

Related Questions