Rutger Kassies
Rutger Kassies

Reputation: 64443

Add a column with a groupby on a hierarchical dataframe

I have a dataframe structured like this:

First     A                             B                         
Second  bar       baz       foo       bar       baz       foo     
Third   cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog
0         3    8    7    7    4    7    5    3    2    2    6    2
1         8    6    5    7    8    7    1    8    6    0    3    9
2         9    2    2    9    7    3    1    8    4    1    0    8
3         3    6    0    6    3    2    2    6    2    4    6    9
4         7    6    4    3    1    5    0    4    8    4    8    1

So there are three column levels. I want to add a new column on the second level where for each of the third levels a computation is performed, for example 'new' = 'foo' + 'bar'. So the resulting dataframe would look like:

First     A                                       B                                   
Second  bar       baz       foo       new       bar       baz       foo       new     
Third   cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog
0         3    8    7    7    4    7    7   15    5    3    2    2    6    2   11    5
1         8    6    5    7    8    7   16   13    1    8    6    0    3    9    4   17
2         9    2    2    9    7    3   16    5    1    8    4    1    0    8    1   16
3         3    6    0    6    3    2    6    8    2    6    2    4    6    9    8   15
4         7    6    4    3    1    5    8   11    0    4    8    4    8    1    8    5

I have found a workaround which is listed at the end of this post, but its not at all 'panda-style' and prone to errors. The apply or transform function on a group seems like the right way to go but after hours of trying I still do not succeed. I figured the correct way should be something like:

def func(data):

    fi = data.columns[0][0]
    th = data.columns[0][2]

    data[(fi,'new',th)] = data[(fi,'foo',th)] + data[(fi,'bar',th)]

    print data
    return data

print grouped.apply(func)

The new column is properly added within the function, but is not returned. Using the same function with transform would work if the 'new' column already exists in the df, but how do you add a new column at a specific level 'on the fly' or before grouping?

The code to generate the sample df is:

import pandas, itertools

first = ['A','B']
second = ['foo','bar','baz']
third = ['dog', 'cat']

tuples = []
for tup in itertools.product(first, second, third):
    tuples.append(tup)

columns = pandas.MultiIndex.from_tuples(tuples, names=['First','Second','Third'])

data = np.random.randint(0,10,(5, 12))
df = pandas.DataFrame(data, columns=columns)

And my workaround:

dfnew = None
grouped = df.groupby(by=None, level=[0,2], axis=1)

for name, group in grouped:
    newparam = group.xs('foo', axis=1, level=1) + group.xs('bar', axis=1, level=1)

    dftmp = group.join(pandas.DataFrame(np.array(newparam), columns=pandas.MultiIndex.from_tuples([(group.columns[0][0], 'new', group.columns[0][2])], names=['First','Second', 'Third'])))

    if dfnew is None:
        dfnew = dftmp
    else:
        dfnew = pandas.concat([dfnew, dftmp], axis=1)

print dfnew.sort_index(axis=1)

Which works, but creating a new dataframe for each group and 'manually' assigning the levels is a really bad practice.

So what is the proper way to do this? I found several posts dealing with similar questions, but all of these had only 1 level of columns, and that's exactly what I'm struggling with.

Upvotes: 9

Views: 1368

Answers (1)

Wes McKinney
Wes McKinney

Reputation: 105591

There definitely is a weakness in the API here but I'm not sure off the top of my head to make it easier to do what you're doing. Here's one simple way around this, at least for your example:

In [20]: df
Out[20]: 
First     A                             B                         
Second  foo       bar       baz       foo       bar       baz     
Third   dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat
0         7    2    9    3    3    0    5    9    8    2    0    6
1         1    4    1    7    2    3    2    3    1    0    4    0
2         6    5    0    6    6    1    5    1    7    4    3    6
3         4    8    1    9    0    3    9    2    3    1    5    9
4         6    1    1    5    1    2    2    6    3    7    2    1

In [21]: rdf = df.stack(['First', 'Third'])

In [22]: rdf['new'] = rdf.foo + rdf.bar

In [23]: rdf
Out[23]: 
Second         bar  baz  foo  new
  First Third                    
0 A     cat      3    0    2    5
        dog      9    3    7   16
  B     cat      2    6    9   11
        dog      8    0    5   13
1 A     cat      7    3    4   11
        dog      1    2    1    2
  B     cat      0    0    3    3
        dog      1    4    2    3
2 A     cat      6    1    5   11
        dog      0    6    6    6
  B     cat      4    6    1    5
        dog      7    3    5   12
3 A     cat      9    3    8   17
        dog      1    0    4    5
  B     cat      1    9    2    3
        dog      3    5    9   12
4 A     cat      5    2    1    6
        dog      1    1    6    7
  B     cat      7    1    6   13
        dog      3    2    2    5

In [24]: rdf.unstack(['First', 'Third'])
Out[24]: 
Second  bar                 baz                 foo                 new               
First     A         B         A         B         A         B         A         B     
Third   cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog
0         3    9    2    8    0    3    6    0    2    7    9    5    5   16   11   13
1         7    1    0    1    3    2    0    4    4    1    3    2   11    2    3    3
2         6    0    4    7    1    6    6    3    5    6    1    5   11    6    5   12
3         9    1    1    3    3    0    9    5    8    4    2    9   17    5    3   12
4         5    1    7    3    2    1    1    2    1    6    6    2    6    7   13    5

And you can of course rearrange to your heart's content:

In [28]: rdf.unstack(['First', 'Third']).reorder_levels(['First', 'Second', 'Third'], axis=1).sortlevel(0, axis=1)
Out[28]: 
First     A                                       B                                   
Second  bar       baz       foo       new       bar       baz       foo       new     
Third   cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog  cat  dog
0         3    9    0    3    2    7    5   16    2    8    6    0    9    5   11   13
1         7    1    3    2    4    1   11    2    0    1    0    4    3    2    3    3
2         6    0    1    6    5    6   11    6    4    7    6    3    1    5    5   12
3         9    1    3    0    8    4   17    5    1    3    9    5    2    9    3   12
4         5    1    2    1    1    6    6    7    7    3    1    2    6    2   13    5

Upvotes: 7

Related Questions