Reputation: 343
A toy dataset from here:
import numpy as np
import pandas as pd
import seaborn as sns
import xarray as xr
np.random.seed(123)
xr.set_options(display_style="html")
times = pd.date_range("2000-01-01", "2001-12-31", name="time")
annual_cycle = np.sin(2 * np.pi * (times.dayofyear.values / 365.25 - 0.28))
base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)
ds = xr.Dataset(
{
"tmin": (("time", "location"), tmin_values),
"tmax": (("time", "location"), tmax_values),
},
{"time": times, "location": ["IA", "IN", "IL"]},
)
I know that here I can find how to subtract mean monthly values from a variable in xarray.DataSet()
as below:
climatology = ds.groupby("time.month").mean("time")
anomalies = ds.groupby("time.month") - climatology
anomalies.mean("location").to_dataframe()[["tmin", "tmax"]].plot()
Then, can I do the subtraction for each location?
I tried to do it for location-month groups, but the xarray.DataSet.groupby()
doesn't allow to pass multiple groups. Then, I tried to make the location-month using xarray.DataSet.stack()
, but it only allows to pass dimension; I could extract month values using time.month
but they were restored as a new variable, not a dimension. I can use for
or xarray.DataSet.apply()
for all locations, but it is too slow (I have about 65000 locations).
Expected results or processes are something like:
for each location:
climatology = ds.groupby("time.month").mean("time")
anomalies = ds.groupby("time.month") - climatology
A solution within only xarray
would be the best, but if it is possible and quite fast using pd.DataFrame()
or others, then those solutions are also welcome.
Here is my current solution using `pd.DataFrame()'
# convert to pd.dataframe
df = ds.to_dataframe()
# get mean monthly values
months = df.index.get_level_values('time').month
df_meanMonths = df.groupby([pd.Grouper(level='location'), months]).mean()
# rename and reindex
df_meanMonths.rename(columns={'tmin': 'tminMM', 'tmax': 'tmaxMM'}, inplace=True)
df_meanMonths.index.set_names('month', level='time', inplace=True)
# merge
df['month'] = df.index.get_level_values('time').month
vars_join = ['tminMM', 'tmaxMM']
join_right = df_meanMonths[vars_join]
# results
df.reset_index().set_index(['location', 'month']).merge(join_right, how='left', left_index=True, right_on=['location', 'month'])
Upvotes: 4
Views: 1043
Reputation: 128
I think that you might be looking for is this:
anomalies = xr.apply_ufunc(
lambda x, mean: x - mean,
ds.tmax.groupby('time.month'),
ds.tmax.groupby('time.month').mean()
).drop('month')
for just the tmax variable (a DataArray) or
anomalies = xr.apply_ufunc(
lambda x, means: x - means,
ds.groupby('time.month'),
ds.groupby('time.month').mean()
).drop('month')
for all variables in the Dataset
Upvotes: 1