I'm using xarray with data for which I have measurements and errors. I store these along a dimension moment in the dataset with coordinates value and variance . When I compute for example the mean along a dimension I need values and variances to be treated differently as the former should be combined as mean_values = sum(values)/len(values) but the latter as mean_variance = sum(variances**2)/len(variances) . Currently I'm doing this by forming two new datasets and concatinating them. This is very ugly, convoluted and not suited to more complex calculations. I would like to be able to do this kind of operation in one step, perhaps by defining a function taking values and variances as input and then broadcasting the dataset dimension moment onto it. Given a dataset q_lp with dimensions moment, time, position : q_lp_av = q_lp.sel(moment='value').mean(dim='time') q_lp_var = q_lp.sel(moment='variance').reduce(average_of_squares, dim='time') q_lp = xr.concat([q_lp_common_av, q_lp_common_var], dim='moment') where average_of_squares is defined by def average_of_squares(data, axis=None): sums = np.sum(data**2, axis=axis) if axis: return sums/np.shape(data)[axis]**2 return sums/len(data)**2 What better ways are there to handle this? Is it possible to use xr.apply_ufunc and a my_average function to do this in one step and in-place? Should I no be putting theses into one dataset together at all? q_lp is later on combined with other quantities, also with dimensions moment, pos and time, into a DataSet. I'm grateful for discussion, ideas, tips and links to examples. Edit: To clarify, I don't like splitting the DataArray, handling each moment seperately and concatenating them again. I would prefer a possibility to do the following (untested pseudocode for illustration): def multi_moment_average(mean, variance): mean = np.average(mean) variance = np.sum(variance**2)/len(variance) return mean, variance q_lp.reduce(multi_moment_average, broadcast='moment', dim='time') Minimal working example: import numpy as np import xarray as xr def average_of_squares(data, axis=None): sums = np.sum(data**2, axis=axis) if axis: return sums/np.shape(data)[axis]**2 return sums/len(data)**2 times = np.arange(10) positions = np.array([1, 3, 5]) values = np.ones((len(times), len(positions))) * (2 + np.random.rand()) variance = np.ones((len(times), len(positions))) * np.random.rand() q_lp = xr.DataArray(np.array([values, variance]), coords=[['value', 'variance'], times, positions], dims=['moment', 'time', 'position']) q_lp_av = q_lp.sel(moment='value').mean(dim='time') q_lp_var = q_lp.sel(moment='variance').reduce(average_of_squares, dim='time') q_lp = xr.concat([q_lp_av, q_lp_var], dim='moment')

Reputation: 110

xarray coordinate-dependent computation

I'm using xarray with data for which I have measurements and errors. I store these along a dimension moment in the dataset with coordinates value and variance. When I compute for example the mean along a dimension I need values and variances to be treated differently as the former should be combined as

mean_values = sum(values)/len(values)

but the latter as

mean_variance = sum(variances**2)/len(variances).

Currently I'm doing this by forming two new datasets and concatinating them. This is very ugly, convoluted and not suited to more complex calculations. I would like to be able to do this kind of operation in one step, perhaps by defining a function taking values and variances as input and then broadcasting the dataset dimension moment onto it.

Given a dataset q_lp with dimensions moment, time, position:

q_lp_av = q_lp.sel(moment='value').mean(dim='time')
q_lp_var = q_lp.sel(moment='variance').reduce(average_of_squares, dim='time')
q_lp = xr.concat([q_lp_common_av, q_lp_common_var], dim='moment')

where average_of_squares is defined by

def average_of_squares(data, axis=None):
    sums = np.sum(data**2, axis=axis)
    if axis:
        return sums/np.shape(data)[axis]**2
    return sums/len(data)**2

What better ways are there to handle this?
Is it possible to use xr.apply_ufunc and a my_average function to do this in one step and in-place?
Should I no be putting theses into one dataset together at all? q_lp is later on combined with other quantities, also with dimensions moment, pos and time, into a DataSet.

I'm grateful for discussion, ideas, tips and links to examples.

Edit: To clarify, I don't like splitting the DataArray, handling each moment seperately and concatenating them again. I would prefer a possibility to do the following (untested pseudocode for illustration):

def multi_moment_average(mean, variance):
    mean = np.average(mean)
    variance = np.sum(variance**2)/len(variance)
    return mean, variance

q_lp.reduce(multi_moment_average, broadcast='moment', dim='time')

Minimal working example:

import numpy as np
import xarray as xr


def average_of_squares(data, axis=None):
    sums = np.sum(data**2, axis=axis)
    if axis:
        return sums/np.shape(data)[axis]**2
    return sums/len(data)**2


times = np.arange(10)
positions = np.array([1, 3, 5])
values = np.ones((len(times), len(positions))) * (2 + np.random.rand())
variance = np.ones((len(times), len(positions))) * np.random.rand()

q_lp = xr.DataArray(np.array([values, variance]),
                    coords=[['value', 'variance'], times, positions],
                    dims=['moment', 'time', 'position'])

q_lp_av = q_lp.sel(moment='value').mean(dim='time')
q_lp_var = q_lp.sel(moment='variance').reduce(average_of_squares, dim='time')
q_lp = xr.concat([q_lp_av, q_lp_var], dim='moment')

Upvotes: 1

Answers (2)

darudiith

Reputation: 110

I found a solution that suits my needs, but am still grateful for more suggestions:

groupby can seperate a Dataset or DataArray along a specified dimension, list thereof creates (key, value) tuples and dict of this has essentially the form of a keyword dictionary. See http://xarray.pydata.org/en/stable/groupby.html

My current solution thus looks like this:

import xarray as xr

def function_applier(data, function, split_dimension=None, **function_kwargs):
    return xr.concat(
                function(
                    **dict(list(data.groupby(split_dimension))),
                    **function_kwargs),
                dim=split_dimension)

Now I can define functions taking specific coordinates as inputs which can be written to also work for e.g. numpy arrays. (MWE using the specific example of my original question here)

import numpy as np

def average_of_gaussians(val, var, dim=None): 
    return val.mean(dim), (var ** 2).sum(dim)/var.count(dim)

val = np.random.rand(12).reshape(2,6)
var = 0.1*np.random.rand(12).reshape(2,6)

da = xr.DataArray([val, var],
                  dims=['moment','time','position'],
                  coords=[['val','var'],
                          np.arange(6),
                          ['a','b']])

>>>da
<xarray.DataArray (moment: 2, position: 2, time: 6)>
array([[[0.66233728, 0.71419351, 0.96758741, 0.96949021, 0.94594299,
         0.05080628],
        [0.44005458, 0.64616657, 0.69865189, 0.84970553, 0.19561433,
         0.8529829 ]],

       [[0.02209967, 0.02152369, 0.09181031, 0.00223527, 0.01448938,
         0.01484197],
        [0.05651841, 0.04942305, 0.08250529, 0.04258035, 0.00184209,
         0.0957248 ]]])
Coordinates:
  * moment    (moment) <U3 'val' 'var'
  * position  (position) <U1 'a' 'b'
  * time      (time) int32 0 1 2 3 4 5

>>>function_applier(da,
                 average_of_gaussians,
                 split_dimension='moment',
                 dim='time')
<xarray.DataArray (moment: 2, position: 2)>
array([[0.71839295, 0.61386263],
       [0.001636  , 0.00390397]])
Coordinates:
  * position  (position) <U1 'a' 'b'
  * moment    (moment) object 'val' 'var'

Note the input names equal to the coordinates for average_of_gaussians. The different operation on each variable in one function and the lack of references to xarray within it are the properties I am after.

Upvotes: 0

Maximilian

Reputation: 8530

I think you can write your function in an xarray-friendly way, and then call it on your data. i.e.

def average_of_squares(data, dim=None):
    sums = (data ** 2).sum(dim)
    return sums/data.count(dim)**2

q_lp_var = q_lp.sel(moment='variance').pipe(average_of_squares, dim='time')

Having them concat-ed in the same DataArray is fine; it might be a more natural fit for items on a Dataset, though.

Does that answer your question?

Edit: re the edited question, I think holding the items in a Dataset rather than a DataArray is most coherent with the data structures. It seems like the mean & variance are two different arrays you want aligned on the same indexes, so a Dataset is ideal

Upvotes: 1

xarray coordinate-dependent computation

Answers (2)

Related Questions