Reputation: 21
I'm having trouble to figure out the best or easiest way to store multiple data about the same entity in a pandas column, to exemplify, i have some pandas dataframes that looks like this:
a b c
item0 2.0 NaN 1.1
item1 1.3 2.2 2.0
item2 1.4 NaN NaN
a b c
item0 foo bar bar
item1 bar foo bar
item2 foo foo bar
I would like to aggregate these dataframes values into one, I'm not being successful in trying to assign them to a multi index dataframe, this is what I'm hoping to get:
a b c
item0 {prop1:2.0, prop2: foo} {prop1:NaN, prop2: bar} {prop1:1.1, prop2: bar}
item1 {prop1:1.3, prop2: bar} {prop1:2.2, prop2: foo} {prop1:2.0, prop2: bar}
item2 {prop1:1.4, prop2: foo} {prop1:NaN, prop2: foo} {prop1:NaN, prop2: bar}
Or
a b c
prop1 prop2 prop1 prop2 prop1 prop2
item0 2.0 foo NaN bar 1.1 bar
item1 1.3 bar 2.2 foo 2.0 bar
item2 1.4 foo NaN foo NaN bar
Is there a simple way to aggregate multiple dataframes in these forms?
Upvotes: 1
Views: 1671
Reputation: 59549
The second option is preferable. You lose much of the pandas
efficiency when you store objects in a DataFrame like dictionaries. Also basic manipulations become more difficult.
Since the alignment is on the Index, this is just concat
with the keys
argument. You can then swap the levels if you want prop
on the bottom.
res = (pd.concat([df1, df2], axis=1, keys=['prop1', 'prop2'])
.swaplevel(0,1, axis=1)
.sort_index(axis=1))
print(res)
a b c
prop1 prop2 prop1 prop2 prop1 prop2
item0 2.0 foo NaN bar 1.1 bar
item1 1.3 bar 2.2 foo 2.0 bar
item2 1.4 foo NaN foo NaN bar
Upvotes: 2
Reputation: 109546
df1 = pd.DataFrame(
{'a': [2., 1.3, 1.4], 'b': [np.nan, 2.2, np.nan], 'c': [1.1, 2., np.nan]},
index=['item0', 'item1', 'item2']
)
df2 = pd.DataFrame(
{'a': ['foo', 'bar', 'foo'], 'b': ['bar', 'foo', 'foo'], 'c': ['bar'] * 3},
index=['item0', 'item1', 'item2']
)
df1.columns = pd.MultiIndex.from_product([df1, ['prop1']])
df2.columns = pd.MultiIndex.from_product([df2, ['prop2']])
>>> pd.concat([df1, df2], axis=1).sort_index(axis=1, level=0)
a b c
prop1 prop2 prop1 prop2 prop1 prop2
item0 2.0 foo NaN bar 1.1 bar
item1 1.3 bar 2.2 foo 2.0 bar
item2 1.4 foo NaN foo NaN bar
Alternatively, concatenate the dataframes (assuming that they have identical columns), assign a new multi-index, then resort back to the original order:
df = pd.concat([df1, df2], axis=1)
df.columns = pd.MultiIndex.from_tuples(product(['prop1', 'prop2'], df1))
df = df.swaplevel(0, 1, axis=1)[product(df1, ['prop1', 'prop2'])]
Or per the keys
parameter as used by @ALollz:
keys = ['prop1', 'prop2']
df = pd.concat([df1, df2], axis=1, keys=keys)
df = df.swaplevel(0, 1, axis=1)[product(df1, keys)]
Upvotes: 1
Reputation: 21
Is there a simple way to aggregate multiple dataframes in these forms?
As I understand you question, you are trying to figure out which index to use for the aggregation or merging of multiple DataFrames.
Pandas
offers three ways, concat()
, merge()
, join()
You need to specify the index, Left, Inner, Right.
Src: Merge DF
If this seems like what you are looking for, I can extend on this answer.
Upvotes: 0