Reputation: 110
I'm using xarray with data for which I have measurements and errors. I store these along a dimension moment in the dataset with coordinates value and variance. When I compute for example the mean along a dimension I need values and variances to be treated differently as the former should be combined as
mean_values = sum(values)/len(values)
but the latter as
mean_variance = sum(variances**2)/len(variances)
.
Currently I'm doing this by forming two new datasets and concatinating them. This is very ugly, convoluted and not suited to more complex calculations. I would like to be able to do this kind of operation in one step, perhaps by defining a function taking values and variances as input and then broadcasting the dataset dimension moment onto it.
Given a dataset q_lp with dimensions moment, time, position:
q_lp_av = q_lp.sel(moment='value').mean(dim='time')
q_lp_var = q_lp.sel(moment='variance').reduce(average_of_squares, dim='time')
q_lp = xr.concat([q_lp_common_av, q_lp_common_var], dim='moment')
where average_of_squares is defined by
def average_of_squares(data, axis=None):
sums = np.sum(data**2, axis=axis)
if axis:
return sums/np.shape(data)[axis]**2
return sums/len(data)**2
I'm grateful for discussion, ideas, tips and links to examples.
Edit: To clarify, I don't like splitting the DataArray, handling each moment seperately and concatenating them again. I would prefer a possibility to do the following (untested pseudocode for illustration):
def multi_moment_average(mean, variance):
mean = np.average(mean)
variance = np.sum(variance**2)/len(variance)
return mean, variance
q_lp.reduce(multi_moment_average, broadcast='moment', dim='time')
Minimal working example:
import numpy as np
import xarray as xr
def average_of_squares(data, axis=None):
sums = np.sum(data**2, axis=axis)
if axis:
return sums/np.shape(data)[axis]**2
return sums/len(data)**2
times = np.arange(10)
positions = np.array([1, 3, 5])
values = np.ones((len(times), len(positions))) * (2 + np.random.rand())
variance = np.ones((len(times), len(positions))) * np.random.rand()
q_lp = xr.DataArray(np.array([values, variance]),
coords=[['value', 'variance'], times, positions],
dims=['moment', 'time', 'position'])
q_lp_av = q_lp.sel(moment='value').mean(dim='time')
q_lp_var = q_lp.sel(moment='variance').reduce(average_of_squares, dim='time')
q_lp = xr.concat([q_lp_av, q_lp_var], dim='moment')
Upvotes: 1
Views: 801
Reputation: 110
I found a solution that suits my needs, but am still grateful for more suggestions:
groupby can seperate a Dataset or DataArray along a specified dimension, list thereof creates (key, value) tuples and dict of this has essentially the form of a keyword dictionary. See http://xarray.pydata.org/en/stable/groupby.html
My current solution thus looks like this:
import xarray as xr
def function_applier(data, function, split_dimension=None, **function_kwargs):
return xr.concat(
function(
**dict(list(data.groupby(split_dimension))),
**function_kwargs),
dim=split_dimension)
Now I can define functions taking specific coordinates as inputs which can be written to also work for e.g. numpy arrays. (MWE using the specific example of my original question here)
import numpy as np
def average_of_gaussians(val, var, dim=None):
return val.mean(dim), (var ** 2).sum(dim)/var.count(dim)
val = np.random.rand(12).reshape(2,6)
var = 0.1*np.random.rand(12).reshape(2,6)
da = xr.DataArray([val, var],
dims=['moment','time','position'],
coords=[['val','var'],
np.arange(6),
['a','b']])
>>>da
<xarray.DataArray (moment: 2, position: 2, time: 6)>
array([[[0.66233728, 0.71419351, 0.96758741, 0.96949021, 0.94594299,
0.05080628],
[0.44005458, 0.64616657, 0.69865189, 0.84970553, 0.19561433,
0.8529829 ]],
[[0.02209967, 0.02152369, 0.09181031, 0.00223527, 0.01448938,
0.01484197],
[0.05651841, 0.04942305, 0.08250529, 0.04258035, 0.00184209,
0.0957248 ]]])
Coordinates:
* moment (moment) <U3 'val' 'var'
* position (position) <U1 'a' 'b'
* time (time) int32 0 1 2 3 4 5
>>>function_applier(da,
average_of_gaussians,
split_dimension='moment',
dim='time')
<xarray.DataArray (moment: 2, position: 2)>
array([[0.71839295, 0.61386263],
[0.001636 , 0.00390397]])
Coordinates:
* position (position) <U1 'a' 'b'
* moment (moment) object 'val' 'var'
Note the input names equal to the coordinates for average_of_gaussians. The different operation on each variable in one function and the lack of references to xarray within it are the properties I am after.
Upvotes: 0
Reputation: 8530
I think you can write your function in an xarray-friendly way, and then call it on your data. i.e.
def average_of_squares(data, dim=None):
sums = (data ** 2).sum(dim)
return sums/data.count(dim)**2
q_lp_var = q_lp.sel(moment='variance').pipe(average_of_squares, dim='time')
Having them concat-ed in the same DataArray
is fine; it might be a more natural fit for items on a Dataset
, though.
Does that answer your question?
Edit: re the edited question, I think holding the items in a Dataset rather than a DataArray is most coherent with the data structures. It seems like the mean & variance are two different arrays you want aligned on the same indexes, so a Dataset is ideal
Upvotes: 1