Reputation: 115
I have data grouped into a 3-D DataArray named 'da' with dimensions 'time', 'indicators', and 'coins', using Dask as a backend:
I need to select the data for peculiar indicator, calculate a new indicator based on it, and append this newly calculated indicator to da along indicators dimension using the new indicator name (let's call it daily_return). In somewhat simplistic terms of a 2-D analogy, I need to perform something like calculating a new pandas DataFrame column based on its other columns, but in 3-D.
So far I've tried to apply_ufunc() with both drop=False (then I retrieve scalar indicators coordinate on the resulting DataArray) and drop=True (respectively, indicators are dropped) using the corresponding tutorial:
dr_func = lambda today, yesterday: today / yesterday - 1 # Assuming for simplicity that yesterday != 0
today_da = da.sel(indicators="price_daily_close_usd", drop=False) # or drop=True
yesterday_da = da.shift(time=1).sel(indicators="price_daily_close_usd", drop=False) # or drop=True
dr = xr.apply_ufunc(
dr_func,
today_da,
yesterday_da,
dask="allowed",
dask_gufunc_kwargs={"allow_rechunk": True},
vectorize=True
)
Obviously, in case of drop=True I cannot concat da and dr DataArrays, since indicators are not present among dr's coordinates.
In its turn, in case of drop=False I've managed to concat these DataArrays along indicators; however, the resulting indicators coord would contain two similarly named CoordinateVariables, specifically "price_daily_close_usd":
...while the second of them should be renamed into "daily_return". I've also tried to extract the needed data from dr through .sel(), but failed due to the absence of index along indicators dimension (as far as I've understood, it's not possible to set an index in this case, since this dimension is scalar):
dr.sel(indicators="price_daily_close_usd") # Would result in KeyError: "no index found for coordinate 'indicators'"
Moreover, the solution above is not done in-place - i.e. it creates a new combined DataArray instance instead of modifying da, while the latter would be highly preferable.
How can I append new data to da along existing dimension, desirably in-place?
Loading all the data directly into RAM would hardly be possible due to its huge volumes, that's why Dask is being used.
I'm also not sticking to the DataArray data structure and it would be no problem to switch to a Dataset if it has more suitable methods for solving my problem.
Upvotes: 1
Views: 792
Reputation: 6434
Xarray does not support appending in-place. Any change to the shape of your array will need to produce a new array.
If you want to work with a single array and know the size of the final array, you could generate an empty array and assign values based on coordinate labels.
I need to perform something like calculating a new pandas DataFrame column based on its other columns, but in 3-D.
Xarray's Dataset is a better analog to the Pandas.Dataframe. The Dataset is a dict-like container storing ND arrays (DataArray's) just like the Dataframe is a dict-like container storing 1D arrays (Series).
Upvotes: 3