Reputation: 581
I have a dictionary of dataframes where each dictionary key corresponds to sample name and the dataframe itself has a "time" column and some measurement columns (temperature, concentration,etc). The time column is not consistent among samples (both start and finish times are different for different samples, although I think all time points between start and finish are measured/have same dT).
I want to merge all of the data into a single xarray where one axis is the time, the other axis is the measurement type, and the third axis is the sample name. Since not all times are measured for all samples the missing data should be replaced with nan.
I have little experience with xarray but doing a simple merger (construct xarray from dict of xarrays) I couldn't figure out how to make "time" one of the axis (instead each just concatenated all of the samples with time being one of the data columns).
Thank you for your help!
Edit:
Here is code I have with dummy data
import pandas as pd
import xarray as xr
#make fake data
dfs = {'sample1':pd.DataFrame([[1,0,0],[2,0,0],[3,0,0]],columns = ["Time","ColA","ColB"]),
'sample2':pd.DataFrame([[2,1,1],[3,1,1],[4,1,1]],columns = ["Time","ColA","ColB"])}
#code I use for real data
xrs = {k: xr.DataArray(v) for k, v in dfs.items()}
merged = xr.Dataset(variables).to_array(dim="samples")
print(merged)
Output is:
<xarray.DataArray (samples: 2, dim_0: 3, dim_1: 3)>
array([[[1, 0, 0],
[2, 0, 0],
[3, 0, 0]],
[[2, 1, 1],
[3, 1, 1],
[4, 1, 1]]], dtype=int64)
Coordinates:
* dim_0 (dim_0) int64 0 1 2
* dim_1 (dim_1) object 'Time' 'ColA' 'ColB'
* samples (samples) <U7 'sample1' 'sample2'
Desired output:
<xarray.DataArray (samples: 2, Time: 4, dim_1: 2)>
array([[[0, 0],
[0, 0],
[0, 0],
[nan, nan]],
[[nan, nan]
[1, 1],
[1, 1],
[1, 1]]], dtype=int64)
Coordinates:
* Time (Time) int64 1 2 3 4
* dim_1 (dim_1) object 'ColA' 'ColB'
* samples (samples) <U7 'sample1' 'sample2'
Upvotes: 3
Views: 1745
Reputation: 9623
Xarray supports two basic formats for converting from pandas:
xarray.DataArray
, where the index
and columns
already correctly label the axes.index
correspond to the desired axes on the result and columns
correspond to desired variables in an xarray.Dataset
.Your format is closest to the former, so let's try that. It turns out all you were missing is a call to .set_index('Time')
to indicate that that column should be used for labeling the axis:
In [23]: da = xr.Dataset({k: v.set_index('Time') for k, v in dfs.items()}).to_array(dim='samples')
In [24]: da
Out[24]:
<xarray.DataArray (samples: 2, Time: 4, dim_1: 2)>
array([[[ 0., 0.],
[ 0., 0.],
[ 0., 0.],
[ nan, nan]],
[[ nan, nan],
[ 1., 1.],
[ 1., 1.],
[ 1., 1.]]])
Coordinates:
* Time (Time) int64 1 2 3 4
* dim_1 (dim_1) object 'ColA' 'ColB'
* samples (samples) <U7 'sample1' 'sample2'
The second approach is also worth thinking about, because making your data tidy will also make it easier to work with as a DataFrame (e.g., to easily plot your data in Seaborn).
Once you have the xarray.DataArray
, you can convert it back into pandas using to_series()
:
In [22]: da.to_series()
Out[22]:
samples Time dim_1
sample1 1 ColA 0.0
ColB 0.0
2 ColA 0.0
ColB 0.0
3 ColA 0.0
ColB 0.0
4 ColA NaN
ColB NaN
sample2 1 ColA NaN
ColB NaN
2 ColA 1.0
ColB 1.0
3 ColA 1.0
ColB 1.0
4 ColA 1.0
ColB 1.0
dtype: float64
Or sticking with DataFrames, you can use pd.melt
to "melt" your wide data into long, tidy form, e.g.,
In [26]: melted = pd.melt(dfs['sample1'], id_vars=['Time'], var_name='Measurement')
In [27]: melted
Out[27]:
Time Measurement value
0 1 ColA 0
1 2 ColA 0
2 3 ColA 0
3 1 ColB 0
4 2 ColB 0
5 3 ColB 0
Then converting from xarray is just a matter of setting a MultiIndex and calling .to_xarray()
:
In [28]: melted.set_index(['Time', 'Measurement']).to_xarray()
Out[28]:
<xarray.Dataset>
Dimensions: (Measurement: 2, Time: 3)
Coordinates:
* Time (Time) int64 1 2 3
* Measurement (Measurement) object 'ColA' 'ColB'
Data variables:
value (Time, Measurement) int64 0 0 0 0 0 0
Upvotes: 1