Ilya
Ilya

Reputation: 581

Using xarray to merge pandas dataframes based on a column

I have a dictionary of dataframes where each dictionary key corresponds to sample name and the dataframe itself has a "time" column and some measurement columns (temperature, concentration,etc). The time column is not consistent among samples (both start and finish times are different for different samples, although I think all time points between start and finish are measured/have same dT).

I want to merge all of the data into a single xarray where one axis is the time, the other axis is the measurement type, and the third axis is the sample name. Since not all times are measured for all samples the missing data should be replaced with nan.

I have little experience with xarray but doing a simple merger (construct xarray from dict of xarrays) I couldn't figure out how to make "time" one of the axis (instead each just concatenated all of the samples with time being one of the data columns).

Thank you for your help!

Edit:

Here is code I have with dummy data

import pandas as pd
import xarray as xr

#make fake data
dfs = {'sample1':pd.DataFrame([[1,0,0],[2,0,0],[3,0,0]],columns = ["Time","ColA","ColB"]), 
    'sample2':pd.DataFrame([[2,1,1],[3,1,1],[4,1,1]],columns = ["Time","ColA","ColB"])}

#code I use for real data
xrs = {k: xr.DataArray(v) for k, v in dfs.items()}
merged = xr.Dataset(variables).to_array(dim="samples")

print(merged)

Output is:

<xarray.DataArray (samples: 2, dim_0: 3, dim_1: 3)>
array([[[1, 0, 0],
        [2, 0, 0],
        [3, 0, 0]],

       [[2, 1, 1],
        [3, 1, 1],
        [4, 1, 1]]], dtype=int64)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2
  * dim_1    (dim_1) object 'Time' 'ColA' 'ColB'
  * samples  (samples) <U7 'sample1' 'sample2'

Desired output:

<xarray.DataArray (samples: 2, Time: 4, dim_1: 2)>
array([[[0, 0],
    [0, 0],
    [0, 0],
    [nan, nan]],

   [[nan, nan]
    [1, 1],
    [1, 1],
    [1, 1]]], dtype=int64)
Coordinates:
  * Time     (Time) int64 1 2 3 4
  * dim_1    (dim_1) object 'ColA' 'ColB'
  * samples  (samples) <U7 'sample1' 'sample2'

Upvotes: 3

Views: 1745

Answers (1)

shoyer
shoyer

Reputation: 9623

Xarray supports two basic formats for converting from pandas:

  1. DataFrames representing an 2D matrix that you want to convert into a xarray.DataArray, where the index and columns already correctly label the axes.
  2. DataFrames representing flattened "tidy data", where MultiIndex levels in the index correspond to the desired axes on the result and columns correspond to desired variables in an xarray.Dataset.

Your format is closest to the former, so let's try that. It turns out all you were missing is a call to .set_index('Time') to indicate that that column should be used for labeling the axis:

In [23]: da = xr.Dataset({k: v.set_index('Time') for k, v in dfs.items()}).to_array(dim='samples')

In [24]: da
Out[24]:
<xarray.DataArray (samples: 2, Time: 4, dim_1: 2)>
array([[[  0.,   0.],
        [  0.,   0.],
        [  0.,   0.],
        [ nan,  nan]],

       [[ nan,  nan],
        [  1.,   1.],
        [  1.,   1.],
        [  1.,   1.]]])
Coordinates:
  * Time     (Time) int64 1 2 3 4
  * dim_1    (dim_1) object 'ColA' 'ColB'
  * samples  (samples) <U7 'sample1' 'sample2'

The second approach is also worth thinking about, because making your data tidy will also make it easier to work with as a DataFrame (e.g., to easily plot your data in Seaborn).

Once you have the xarray.DataArray, you can convert it back into pandas using to_series():

In [22]: da.to_series()
Out[22]:
samples  Time  dim_1
sample1  1     ColA     0.0
               ColB     0.0
         2     ColA     0.0
               ColB     0.0
         3     ColA     0.0
               ColB     0.0
         4     ColA     NaN
               ColB     NaN
sample2  1     ColA     NaN
               ColB     NaN
         2     ColA     1.0
               ColB     1.0
         3     ColA     1.0
               ColB     1.0
         4     ColA     1.0
               ColB     1.0
dtype: float64

Or sticking with DataFrames, you can use pd.melt to "melt" your wide data into long, tidy form, e.g.,

In [26]: melted = pd.melt(dfs['sample1'], id_vars=['Time'], var_name='Measurement')

In [27]: melted
Out[27]:
   Time Measurement  value
0     1        ColA      0
1     2        ColA      0
2     3        ColA      0
3     1        ColB      0
4     2        ColB      0
5     3        ColB      0

Then converting from xarray is just a matter of setting a MultiIndex and calling .to_xarray():

In [28]: melted.set_index(['Time', 'Measurement']).to_xarray()
Out[28]:
<xarray.Dataset>
Dimensions:      (Measurement: 2, Time: 3)
Coordinates:
  * Time         (Time) int64 1 2 3
  * Measurement  (Measurement) object 'ColA' 'ColB'
Data variables:
    value        (Time, Measurement) int64 0 0 0 0 0 0

Upvotes: 1

Related Questions