Lemming
Lemming

Reputation: 4205

Pandas: Merge hierarchical data

I am looking for a way to merge data that has a complex hierarchy into a pandas DataFrame. This hierarchy comes about by different inter-dependencies within the data. E.g. there are parameters which define how the data was produced, then there are time-dependent observables, spatially dependent observables, and observables that depend on both time and space.

To be more explicit: Suppose that I have the following data.

#  Parameters
t_max = 2
t_step = 15
sites = 4

# Purely time-dependent
t = np.linspace(0, t_max, t_step)
f_t = t**2 - t

# Purely site-dependent
position = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])  # (x, y)
site_weight = np.arange(sites)

# Time-, and site-dependent.
occupation = np.arange(t_step*sites).reshape((t_step, sites))

# Time-, and site-, site-dependent
correlation = np.arange(t_step*sites*sites).reshape((t_step, sites, sites))

(In the end I would, of course, have many of these sets of data. One for each set of parameters.)

Now, I would like to stash all this into a pandas DataFrame. I imagine the final result to look something like this:

| ----- parameters ----- | -------------------------------- observables --------------------------------- |
|                        |                                        | ---------- time-dependent ----------- |
|                        | ----------- site-dependent --- )       ( ------------------------ |            |
|                        |                                | - site2-dependent - |                         |
| sites | t_max | t_step | site | r_x | r_y | site weight | site2 | correlation | occupation | f_t | time |

I suppose that the partially overlapping hierarchies may be impossible to achieve. It's okay if they are implicit, in the sense that I can get e.g. all site-dependent data by indexing the DataFrame in a specific way.

Also, please feel free to tell me if you think that there is a better way of arranging this data in Pandas.

Question

How can I construct a DataFrame that contains all the above data, and somehow reflects the inter-dependencies (e.g. f_t depends on time, but not on site). And all that in a way that is sufficiently generic, such that it is easy to add, or remove certain observables, with possibly new inter-dependencies. (E.g. a quantity that depends on a second time-axis, like a time-time-correlation.)


What I got so far

In the following I will show you how far I've gotten on my own. However, I don't think that this is the ideal way of achieving the above. Especially, since it lacks generality with respect to adding, or removing certain observables.

Indices

Given the above data I started out by defining all the multi-indices that I am going to need.

ind_time = pd.Index(t, name='time')
ind_site = pd.Index(np.arange(sites), name='site')
ind_site_site = pd.MultiIndex.from_product([ind_site, ind_site], names=['site', 'site2'])
ind_time_site = pd.MultiIndex.from_product([ind_time, ind_site], names=['time', 'site'])
ind_time_site_site = pd.MultiIndex.from_product([ind_time, ind_site, ind_site], names=['time', 'site', 'site2'])

Individual DataFrames

Next, I created data-frames of the individual chunks of data.

df_parms = pd.DataFrame({'t_max': t_max, 't_step': t_step, 'sites': sites}, index=[0])
df_time = pd.DataFrame({'f_t': f_t}, index=ind_time)
df_position = pd.DataFrame(position, columns=['r_x', 'r_y'], index=ind_site)
df_weight = pd.DataFrame(site_weight, columns=['site weight'], index=ind_site)
df_occupation = pd.DataFrame(occupation.flatten(), index=ind_time_site, columns=['occupation'])
df_correlation = pd.DataFrame(correlation.flatten(), index=ind_time_site_site, columns=['correlation'])

The index=[0] in df_parms seems necessary because otherwise Pandas complains about scalar values only. In reality I would probably replace it by a time-stamp of when this particular simulation was run. That would at least convey some useful information.

Merge Observables

With the data-frames available, I join all the observables into one big DataFrame.

df_all_but_parms = pd.merge(
  pd.merge(
    pd.merge(
      df_time.reset_index(),
      df_occupation.reset_index(),
      how='outer'
    ),
    df_correlation.reset_index(),
    how='outer'
  ),
  pd.merge(
    df_position.reset_index(),
    df_weight.reset_index(),
    how='outer'
  ),
  how='outer'
)

This is the bit that I like the least in my current approach. The merge function only works on pairs of data-frames, and it requires them to have at least one common column. So, I have to be careful about the order of joining my data-frames, and if I were to add an orthogonal observable then I could not merge it with the other data because they would not share a common column. Is there one function available that can achieve the same result with just one single call on a list of data-frames? I tried concat but it wouldn't merge common columns. So, I ended up with lots of duplicate time, and site columns.

Merge All Data

Finally, I merge my data with the parameters.

pd.concat([df_parms, df_all_but_parms], axis=1, keys=['parameters', 'observables'])

The end-result, so far, looks like this:

         parameters                 observables                                                                       
              sites  t_max  t_step         time       f_t  site  occupation  site2  correlation  r_x  r_y  site weight
    0             4      2      15     0.000000  0.000000     0           0      0            0    0    0            0
    1           NaN    NaN     NaN     0.000000  0.000000     0           0      1            1    0    0            0
    2           NaN    NaN     NaN     0.000000  0.000000     0           0      2            2    0    0            0
    3           NaN    NaN     NaN     0.000000  0.000000     0           0      3            3    0    0            0
    4           NaN    NaN     NaN     0.142857 -0.122449     0           4      0           16    0    0            0
    ..          ...    ...     ...          ...       ...   ...         ...    ...          ...  ...  ...          ...
    235         NaN    NaN     NaN     1.857143  1.591837     3          55      3          223    1    1            3
    236         NaN    NaN     NaN     2.000000  2.000000     3          59      0          236    1    1            3
    237         NaN    NaN     NaN     2.000000  2.000000     3          59      1          237    1    1            3
    238         NaN    NaN     NaN     2.000000  2.000000     3          59      2          238    1    1            3
    239         NaN    NaN     NaN     2.000000  2.000000     3          59      3          239    1    1            3

As you can see this does not work very well, since only the first row is actually assigned the parameters. All the other rows just have NaNs in place of the parameters. But, since these are the parameters of all of that data, they should also be contained in all the other rows of this data-frame.

As a small side question: How smart would pandas be if I were to store the above data-frame in hdf5. Would I end up with lots of duplicated data, or would it avoid duplicate storage?


Update

Thanks to Jeff's answer I was able to push all my data into one data-frame with a generic merge. The basic idea is, that all my observables already have a few common columns. Namely, the parameters.

First I add the parameters to all my observables' data-frames.

all_observables = [ df_time, df_position, df_weight, df_occupation, df_correlation ]
flat = map(pd.DataFrame.reset_index, all_observables)
for df in flat:
    for c in df_parms:
        df[c] = df_parms.loc[0,c]

And then I can merge all of them together by reduction.

df_all = reduce(lambda a, b: pd.merge(a, b, how='outer'), flat)

The result of which has the desired form:

         time       f_t  sites  t_max  t_step  site  r_x  r_y  site weight  occupation  site2  correlation
0    0.000000  0.000000      4      2      15     0    0    0            0           0      0            0
1    0.000000  0.000000      4      2      15     0    0    0            0           0      1            1
2    0.000000  0.000000      4      2      15     0    0    0            0           0      2            2
3    0.000000  0.000000      4      2      15     0    0    0            0           0      3            3
4    0.142857 -0.122449      4      2      15     0    0    0            0           4      0           16
5    0.142857 -0.122449      4      2      15     0    0    0            0           4      1           17
6    0.142857 -0.122449      4      2      15     0    0    0            0           4      2           18
..        ...       ...    ...    ...     ...   ...  ...  ...          ...         ...    ...          ...
233  1.857143  1.591837      4      2      15     3    1    1            3          55      1          221
234  1.857143  1.591837      4      2      15     3    1    1            3          55      2          222
235  1.857143  1.591837      4      2      15     3    1    1            3          55      3          223
236  2.000000  2.000000      4      2      15     3    1    1            3          59      0          236
237  2.000000  2.000000      4      2      15     3    1    1            3          59      1          237
238  2.000000  2.000000      4      2      15     3    1    1            3          59      2          238
239  2.000000  2.000000      4      2      15     3    1    1            3          59      3          239

By re-indexing the data, the hierarchy becomes a bit more apparent:

df_all.set_index(['t_max', 't_step', 'sites', 'time', 'site', 'site2'], inplace=True)

which results in

                                             f_t  r_x  r_y  site weight  occupation  correlation
t_max t_step sites time     site site2                                                          
2     15     4     0.000000 0    0      0.000000    0    0            0           0            0
                                 1      0.000000    0    0            0           0            1
                                 2      0.000000    0    0            0           0            2
                                 3      0.000000    0    0            0           0            3
                   0.142857 0    0     -0.122449    0    0            0           4           16
                                 1     -0.122449    0    0            0           4           17
                                 2     -0.122449    0    0            0           4           18
...                                          ...  ...  ...          ...         ...          ...
                   1.857143 3    1      1.591837    1    1            3          55          221
                                 2      1.591837    1    1            3          55          222
                                 3      1.591837    1    1            3          55          223
                   2.000000 3    0      2.000000    1    1            3          59          236
                                 1      2.000000    1    1            3          59          237
                                 2      2.000000    1    1            3          59          238
                                 3      2.000000    1    1            3          59          239

Upvotes: 4

Views: 2367

Answers (1)

Jeff
Jeff

Reputation: 128978

I think you should do something like this, putting df_parms as your index. This way you can easily concat more frames with different parms.

In [67]: pd.set_option('max_rows',10)

In [68]: dfx = df_all_but_parms.copy()

You need to assign the columns to the frame (you can also directly construct a multi-index, but this is starting from your data).

In [69]: for c in df_parms.columns:
             dfx[c] = df_parms.loc[0,c]

In [70]: dfx
Out[70]: 
         time       f_t  site  occupation  site2  correlation  r_x  r_y  site weight  sites  t_max  t_step
0    0.000000  0.000000     0           0      0            0    0    0            0      4      2      15
1    0.000000  0.000000     0           0      1            1    0    0            0      4      2      15
2    0.000000  0.000000     0           0      2            2    0    0            0      4      2      15
3    0.000000  0.000000     0           0      3            3    0    0            0      4      2      15
4    0.142857 -0.122449     0           4      0           16    0    0            0      4      2      15
..        ...       ...   ...         ...    ...          ...  ...  ...          ...    ...    ...     ...
235  1.857143  1.591837     3          55      3          223    1    1            3      4      2      15
236  2.000000  2.000000     3          59      0          236    1    1            3      4      2      15
237  2.000000  2.000000     3          59      1          237    1    1            3      4      2      15
238  2.000000  2.000000     3          59      2          238    1    1            3      4      2      15
239  2.000000  2.000000     3          59      3          239    1    1            3      4      2      15

[240 rows x 12 columns]

Set the index (this returns a new object)

In [71]: dfx.set_index(['sites','t_max','t_step'])
Out[71]: 
                        time       f_t  site  occupation  site2  correlation  r_x  r_y  site weight
sites t_max t_step                                                                                 
4     2     15      0.000000  0.000000     0           0      0            0    0    0            0
            15      0.000000  0.000000     0           0      1            1    0    0            0
            15      0.000000  0.000000     0           0      2            2    0    0            0
            15      0.000000  0.000000     0           0      3            3    0    0            0
            15      0.142857 -0.122449     0           4      0           16    0    0            0
...                      ...       ...   ...         ...    ...          ...  ...  ...          ...
            15      1.857143  1.591837     3          55      3          223    1    1            3
            15      2.000000  2.000000     3          59      0          236    1    1            3
            15      2.000000  2.000000     3          59      1          237    1    1            3
            15      2.000000  2.000000     3          59      2          238    1    1            3
            15      2.000000  2.000000     3          59      3          239    1    1            3

[240 rows x 9 columns]

Upvotes: 2

Related Questions