Reputation: 30737
I have a dataset where I'm storing replicates for different classes/subtypes (not sure what to call it) and then attributes for each one. Essentially, there are 5 subtype/classes, 4 replicates for each subtype/class, and 100 attributes that are measured.
Is there a method like np.ravel
or np.flatten
that can merge 2 dimensions using Xarray
?
In this, I want to merge dims subtype
and replicates
so I have a 2D array (or pd.DataFrame
with attributes vs. subtype/replicates
.
It wouldn't need to have the format "coord_1 | coord_2" or anything. It would be useful if it kept the original coord names. Maybe there's something like groupby
that could do this? Groupby
always confuses me so if it's something native to xarray
that would be awesome.
import xarray as xr
import numpy as np
# Set up xr.DataArray
dims = (5,4,100)
DA_data = xr.DataArray(np.random.random(dims), dims=["subtype","replicates","attributes"])
DA_data.coords["subtype"] = ["subtype_%d"%_ for _ in range(dims[0])]
DA_data.coords["replicates"] = ["rep_%d"%_ for _ in range(dims[1])]
DA_data.coords["attributes"] = ["attr_%d"%_ for _ in range(dims[2])]
# DA_data.coords
# Coordinates:
# * subtype (subtype) <U9 'subtype_0' 'subtype_1' 'subtype_2' ...
# * replicates (replicates) <U5 'rep_0' 'rep_1' 'rep_2' 'rep_3'
# * attributes (attributes) <U7 'attr_0' 'attr_1' 'attr_2' 'attr_3' ...
# DA_data.dims
# ('subtype', 'replicates', 'attributes')
# Naive way to collapse the replicate dimension into the subtype dimension
desired_columns = list()
for subtype in DA_data.coords["subtype"]:
for replicate in DA_data.coords["replicates"]:
desired_columns.append(str(subtype.values) + "|" + str(replicate.values))
desired_columns
# ['subtype_0|rep_0',
# 'subtype_0|rep_1',
# 'subtype_0|rep_2',
# 'subtype_0|rep_3',
# 'subtype_1|rep_0',
# 'subtype_1|rep_1',
# 'subtype_1|rep_2',
# 'subtype_1|rep_3',
# 'subtype_2|rep_0',
# 'subtype_2|rep_1',
# 'subtype_2|rep_2',
# 'subtype_2|rep_3',
# 'subtype_3|rep_0',
# 'subtype_3|rep_1',
# 'subtype_3|rep_2',
# 'subtype_3|rep_3',
# 'subtype_4|rep_0',
# 'subtype_4|rep_1',
# 'subtype_4|rep_2',
# 'subtype_4|rep_3']
Upvotes: 5
Views: 5329
Reputation: 9603
Yes, this is exactly what .stack
is for:
In [33]: stacked = DA_data.stack(desired=['subtype', 'replicates'])
In [34]: stacked
Out[34]:
<xarray.DataArray (attributes: 100, desired: 20)>
array([[ 0.54020268, 0.14914837, 0.83398895, ..., 0.25986503,
0.62520466, 0.08617668],
[ 0.47021735, 0.10627027, 0.66666478, ..., 0.84392176,
0.64461418, 0.4444864 ],
[ 0.4065543 , 0.59817851, 0.65033094, ..., 0.01747058,
0.94414244, 0.31467342],
...,
[ 0.23724934, 0.61742922, 0.97563316, ..., 0.62966631,
0.89513904, 0.20139552],
[ 0.21157447, 0.43868899, 0.77488211, ..., 0.98285015,
0.24367352, 0.8061804 ],
[ 0.21518079, 0.234854 , 0.18294781, ..., 0.64679141,
0.49678393, 0.32215219]])
Coordinates:
* attributes (attributes) |S7 'attr_0' 'attr_1' 'attr_2' 'attr_3' ...
* desired (desired) object ('subtype_0', 'rep_0') ...
The resulting stacked coordinate is a pandas.MultiIndex
, whose values are given by tuples:
In [35]: stacked['desired'].values
Out[35]:
array([('subtype_0', 'rep_0'), ('subtype_0', 'rep_1'),
('subtype_0', 'rep_2'), ('subtype_0', 'rep_3'),
('subtype_1', 'rep_0'), ('subtype_1', 'rep_1'),
('subtype_1', 'rep_2'), ('subtype_1', 'rep_3'),
('subtype_2', 'rep_0'), ('subtype_2', 'rep_1'),
('subtype_2', 'rep_2'), ('subtype_2', 'rep_3'),
('subtype_3', 'rep_0'), ('subtype_3', 'rep_1'),
('subtype_3', 'rep_2'), ('subtype_3', 'rep_3'),
('subtype_4', 'rep_0'), ('subtype_4', 'rep_1'),
('subtype_4', 'rep_2'), ('subtype_4', 'rep_3')], dtype=object)
Upvotes: 6