Reputation: 21

Best way to store different values in pandas dataframe columns?

I'm having trouble to figure out the best or easiest way to store multiple data about the same entity in a pandas column, to exemplify, i have some pandas dataframes that looks like this:

                a    b     c
     item0    2.0  NaN   1.1
     item1    1.3  2.2   2.0
     item2    1.4  NaN   NaN

                a    b     c
     item0    foo  bar   bar
     item1    bar  foo   bar
     item2    foo  foo   bar

I would like to aggregate these dataframes values into one, I'm not being successful in trying to assign them to a multi index dataframe, this is what I'm hoping to get:

                                   a                        b                         c
     item0    {prop1:2.0, prop2: foo}  {prop1:NaN, prop2: bar}   {prop1:1.1, prop2: bar}
     item1    {prop1:1.3, prop2: bar}  {prop1:2.2, prop2: foo}   {prop1:2.0, prop2: bar}
     item2    {prop1:1.4, prop2: foo}  {prop1:NaN, prop2: foo}   {prop1:NaN, prop2: bar}

                         a              b              c
              prop1  prop2   prop1  prop2   prop1  prop2
     item0      2.0    foo     NaN    bar     1.1    bar
     item1      1.3    bar     2.2    foo     2.0    bar
     item2      1.4    foo     NaN    foo     NaN    bar

Is there a simple way to aggregate multiple dataframes in these forms?

Upvotes: 1

Answers (3)

ALollz

Reputation: 59549

The second option is preferable. You lose much of the pandas efficiency when you store objects in a DataFrame like dictionaries. Also basic manipulations become more difficult.

Since the alignment is on the Index, this is just concat with the keys argument. You can then swap the levels if you want prop on the bottom.

res = (pd.concat([df1, df2], axis=1, keys=['prop1', 'prop2'])
         .swaplevel(0,1, axis=1)
         .sort_index(axis=1))

print(res)
          a           b           c      
      prop1 prop2 prop1 prop2 prop1 prop2
item0   2.0   foo   NaN   bar   1.1   bar
item1   1.3   bar   2.2   foo   2.0   bar
item2   1.4   foo   NaN   foo   NaN   bar

Upvotes: 2

Alexander

Reputation: 109546

df1 = pd.DataFrame(
    {'a': [2., 1.3, 1.4], 'b': [np.nan, 2.2, np.nan], 'c': [1.1, 2., np.nan]},
     index=['item0', 'item1', 'item2']
)
df2 = pd.DataFrame(
    {'a': ['foo', 'bar', 'foo'], 'b': ['bar', 'foo', 'foo'], 'c': ['bar'] * 3},
     index=['item0', 'item1', 'item2']
)


df1.columns = pd.MultiIndex.from_product([df1, ['prop1']])
df2.columns = pd.MultiIndex.from_product([df2, ['prop2']])

>>> pd.concat([df1, df2], axis=1).sort_index(axis=1, level=0)
          a           b           c      
      prop1 prop2 prop1 prop2 prop1 prop2
item0   2.0   foo   NaN   bar   1.1   bar
item1   1.3   bar   2.2   foo   2.0   bar
item2   1.4   foo   NaN   foo   NaN   bar

Alternatively, concatenate the dataframes (assuming that they have identical columns), assign a new multi-index, then resort back to the original order:

df = pd.concat([df1, df2], axis=1)
df.columns = pd.MultiIndex.from_tuples(product(['prop1', 'prop2'], df1))
df = df.swaplevel(0, 1, axis=1)[product(df1, ['prop1', 'prop2'])]

Or per the keys parameter as used by @ALollz:

keys = ['prop1', 'prop2']
df = pd.concat([df1, df2], axis=1, keys=keys)
df = df.swaplevel(0, 1, axis=1)[product(df1, keys)]

Upvotes: 1

Pat Moss

Reputation: 21

Is there a simple way to aggregate multiple dataframes in these forms?

As I understand you question, you are trying to figure out which index to use for the aggregation or merging of multiple DataFrames.

Pandas offers three ways, concat(), merge(), join()

You need to specify the index, Left, Inner, Right.

Src: Merge DF

If this seems like what you are looking for, I can extend on this answer.

Upvotes: 0

Best way to store different values in pandas dataframe columns?

Answers (3)

Related Questions