Reputation: 4205
I am looking for a way to merge data that has a complex hierarchy into a pandas DataFrame
. This hierarchy comes about by different inter-dependencies within the data. E.g. there are parameters which define how the data was produced, then there are time-dependent observables, spatially dependent observables, and observables that depend on both time and space.
To be more explicit: Suppose that I have the following data.
# Parameters
t_max = 2
t_step = 15
sites = 4
# Purely time-dependent
t = np.linspace(0, t_max, t_step)
f_t = t**2 - t
# Purely site-dependent
position = np.array([[0, 0], [1, 0], [0, 1], [1, 1]]) # (x, y)
site_weight = np.arange(sites)
# Time-, and site-dependent.
occupation = np.arange(t_step*sites).reshape((t_step, sites))
# Time-, and site-, site-dependent
correlation = np.arange(t_step*sites*sites).reshape((t_step, sites, sites))
(In the end I would, of course, have many of these sets of data. One for each set of parameters.)
Now, I would like to stash all this into a pandas DataFrame
. I imagine the final result to look something like this:
| ----- parameters ----- | -------------------------------- observables --------------------------------- |
| | | ---------- time-dependent ----------- |
| | ----------- site-dependent --- ) ( ------------------------ | |
| | | - site2-dependent - | |
| sites | t_max | t_step | site | r_x | r_y | site weight | site2 | correlation | occupation | f_t | time |
I suppose that the partially overlapping hierarchies may be impossible to achieve. It's okay if they are implicit, in the sense that I can get e.g. all site-dependent data by indexing the DataFrame in a specific way.
Also, please feel free to tell me if you think that there is a better way of arranging this data in Pandas.
How can I construct a DataFrame
that contains all the above data, and somehow reflects the inter-dependencies (e.g. f_t
depends on time
, but not on site
). And all that in a way that is sufficiently generic, such that it is easy to add, or remove certain observables, with possibly new inter-dependencies. (E.g. a quantity that depends on a second time-axis, like a time-time-correlation.)
In the following I will show you how far I've gotten on my own. However, I don't think that this is the ideal way of achieving the above. Especially, since it lacks generality with respect to adding, or removing certain observables.
Given the above data I started out by defining all the multi-indices that I am going to need.
ind_time = pd.Index(t, name='time')
ind_site = pd.Index(np.arange(sites), name='site')
ind_site_site = pd.MultiIndex.from_product([ind_site, ind_site], names=['site', 'site2'])
ind_time_site = pd.MultiIndex.from_product([ind_time, ind_site], names=['time', 'site'])
ind_time_site_site = pd.MultiIndex.from_product([ind_time, ind_site, ind_site], names=['time', 'site', 'site2'])
DataFrame
sNext, I created data-frames of the individual chunks of data.
df_parms = pd.DataFrame({'t_max': t_max, 't_step': t_step, 'sites': sites}, index=[0])
df_time = pd.DataFrame({'f_t': f_t}, index=ind_time)
df_position = pd.DataFrame(position, columns=['r_x', 'r_y'], index=ind_site)
df_weight = pd.DataFrame(site_weight, columns=['site weight'], index=ind_site)
df_occupation = pd.DataFrame(occupation.flatten(), index=ind_time_site, columns=['occupation'])
df_correlation = pd.DataFrame(correlation.flatten(), index=ind_time_site_site, columns=['correlation'])
The index=[0]
in df_parms
seems necessary because otherwise Pandas complains about scalar values only. In reality I would probably replace it by a time-stamp of when this particular simulation was run. That would at least convey some useful information.
With the data-frames available, I join all the observables into one big DataFrame
.
df_all_but_parms = pd.merge(
pd.merge(
pd.merge(
df_time.reset_index(),
df_occupation.reset_index(),
how='outer'
),
df_correlation.reset_index(),
how='outer'
),
pd.merge(
df_position.reset_index(),
df_weight.reset_index(),
how='outer'
),
how='outer'
)
This is the bit that I like the least in my current approach. The merge
function only works on pairs of data-frames, and it requires them to have at least one common column. So, I have to be careful about the order of joining my data-frames, and if I were to add an orthogonal observable then I could not merge it with the other data because they would not share a common column. Is there one function available that can achieve the same result with just one single call on a list of data-frames? I tried concat
but it wouldn't merge common columns. So, I ended up with lots of duplicate time
, and site
columns.
Finally, I merge my data with the parameters.
pd.concat([df_parms, df_all_but_parms], axis=1, keys=['parameters', 'observables'])
The end-result, so far, looks like this:
parameters observables
sites t_max t_step time f_t site occupation site2 correlation r_x r_y site weight
0 4 2 15 0.000000 0.000000 0 0 0 0 0 0 0
1 NaN NaN NaN 0.000000 0.000000 0 0 1 1 0 0 0
2 NaN NaN NaN 0.000000 0.000000 0 0 2 2 0 0 0
3 NaN NaN NaN 0.000000 0.000000 0 0 3 3 0 0 0
4 NaN NaN NaN 0.142857 -0.122449 0 4 0 16 0 0 0
.. ... ... ... ... ... ... ... ... ... ... ... ...
235 NaN NaN NaN 1.857143 1.591837 3 55 3 223 1 1 3
236 NaN NaN NaN 2.000000 2.000000 3 59 0 236 1 1 3
237 NaN NaN NaN 2.000000 2.000000 3 59 1 237 1 1 3
238 NaN NaN NaN 2.000000 2.000000 3 59 2 238 1 1 3
239 NaN NaN NaN 2.000000 2.000000 3 59 3 239 1 1 3
As you can see this does not work very well, since only the first row is actually assigned the parameters. All the other rows just have NaN
s in place of the parameters. But, since these are the parameters of all of that data, they should also be contained in all the other rows of this data-frame.
As a small side question: How smart would pandas be if I were to store the above data-frame in hdf5. Would I end up with lots of duplicated data, or would it avoid duplicate storage?
Thanks to Jeff's answer I was able to push all my data into one data-frame with a generic merge. The basic idea is, that all my observables already have a few common columns. Namely, the parameters.
First I add the parameters to all my observables' data-frames.
all_observables = [ df_time, df_position, df_weight, df_occupation, df_correlation ]
flat = map(pd.DataFrame.reset_index, all_observables)
for df in flat:
for c in df_parms:
df[c] = df_parms.loc[0,c]
And then I can merge all of them together by reduction.
df_all = reduce(lambda a, b: pd.merge(a, b, how='outer'), flat)
The result of which has the desired form:
time f_t sites t_max t_step site r_x r_y site weight occupation site2 correlation
0 0.000000 0.000000 4 2 15 0 0 0 0 0 0 0
1 0.000000 0.000000 4 2 15 0 0 0 0 0 1 1
2 0.000000 0.000000 4 2 15 0 0 0 0 0 2 2
3 0.000000 0.000000 4 2 15 0 0 0 0 0 3 3
4 0.142857 -0.122449 4 2 15 0 0 0 0 4 0 16
5 0.142857 -0.122449 4 2 15 0 0 0 0 4 1 17
6 0.142857 -0.122449 4 2 15 0 0 0 0 4 2 18
.. ... ... ... ... ... ... ... ... ... ... ... ...
233 1.857143 1.591837 4 2 15 3 1 1 3 55 1 221
234 1.857143 1.591837 4 2 15 3 1 1 3 55 2 222
235 1.857143 1.591837 4 2 15 3 1 1 3 55 3 223
236 2.000000 2.000000 4 2 15 3 1 1 3 59 0 236
237 2.000000 2.000000 4 2 15 3 1 1 3 59 1 237
238 2.000000 2.000000 4 2 15 3 1 1 3 59 2 238
239 2.000000 2.000000 4 2 15 3 1 1 3 59 3 239
By re-indexing the data, the hierarchy becomes a bit more apparent:
df_all.set_index(['t_max', 't_step', 'sites', 'time', 'site', 'site2'], inplace=True)
which results in
f_t r_x r_y site weight occupation correlation
t_max t_step sites time site site2
2 15 4 0.000000 0 0 0.000000 0 0 0 0 0
1 0.000000 0 0 0 0 1
2 0.000000 0 0 0 0 2
3 0.000000 0 0 0 0 3
0.142857 0 0 -0.122449 0 0 0 4 16
1 -0.122449 0 0 0 4 17
2 -0.122449 0 0 0 4 18
... ... ... ... ... ... ...
1.857143 3 1 1.591837 1 1 3 55 221
2 1.591837 1 1 3 55 222
3 1.591837 1 1 3 55 223
2.000000 3 0 2.000000 1 1 3 59 236
1 2.000000 1 1 3 59 237
2 2.000000 1 1 3 59 238
3 2.000000 1 1 3 59 239
Upvotes: 4
Views: 2367
Reputation: 128978
I think you should do something like this, putting df_parms
as your index. This way you can easily concat more frames with different parms.
In [67]: pd.set_option('max_rows',10)
In [68]: dfx = df_all_but_parms.copy()
You need to assign the columns to the frame (you can also directly construct a multi-index, but this is starting from your data).
In [69]: for c in df_parms.columns:
dfx[c] = df_parms.loc[0,c]
In [70]: dfx
Out[70]:
time f_t site occupation site2 correlation r_x r_y site weight sites t_max t_step
0 0.000000 0.000000 0 0 0 0 0 0 0 4 2 15
1 0.000000 0.000000 0 0 1 1 0 0 0 4 2 15
2 0.000000 0.000000 0 0 2 2 0 0 0 4 2 15
3 0.000000 0.000000 0 0 3 3 0 0 0 4 2 15
4 0.142857 -0.122449 0 4 0 16 0 0 0 4 2 15
.. ... ... ... ... ... ... ... ... ... ... ... ...
235 1.857143 1.591837 3 55 3 223 1 1 3 4 2 15
236 2.000000 2.000000 3 59 0 236 1 1 3 4 2 15
237 2.000000 2.000000 3 59 1 237 1 1 3 4 2 15
238 2.000000 2.000000 3 59 2 238 1 1 3 4 2 15
239 2.000000 2.000000 3 59 3 239 1 1 3 4 2 15
[240 rows x 12 columns]
Set the index (this returns a new object)
In [71]: dfx.set_index(['sites','t_max','t_step'])
Out[71]:
time f_t site occupation site2 correlation r_x r_y site weight
sites t_max t_step
4 2 15 0.000000 0.000000 0 0 0 0 0 0 0
15 0.000000 0.000000 0 0 1 1 0 0 0
15 0.000000 0.000000 0 0 2 2 0 0 0
15 0.000000 0.000000 0 0 3 3 0 0 0
15 0.142857 -0.122449 0 4 0 16 0 0 0
... ... ... ... ... ... ... ... ... ...
15 1.857143 1.591837 3 55 3 223 1 1 3
15 2.000000 2.000000 3 59 0 236 1 1 3
15 2.000000 2.000000 3 59 1 237 1 1 3
15 2.000000 2.000000 3 59 2 238 1 1 3
15 2.000000 2.000000 3 59 3 239 1 1 3
[240 rows x 9 columns]
Upvotes: 2