FooBar
FooBar

Reputation: 16478

Pandas: apply operation to repetitive columns in MultiIndex

I have MultiColumns: the second level repetitively contains Job Openings and Hires. I would like to subtract one from another for each of the top-level columns - but all I try gets me into index-errors or slice errors. How can I compute it?

Sample data:

>>> df.head()
Out[25]: 
           Total nonfarm              Total private               
                   Hires Job openings         Hires Job openings   
date                                                               
2001-01-01          5777         5385          5419         4887   
2002-01-01          4849         3759          4539         3381   
2003-01-01          4971         3824          4645         3424   
2004-01-01          4827         3459          4552         3153   
2005-01-01          5207         3670          4876         3358  

expected output:

Out[25]: 
           Total nonfarm   Total private              
              difference      difference   
date                                                               
2001-01-01          1234            5678          
2002-01-01          1234            5678          
2003-01-01          1234            5678         
2004-01-01          1234            5678      
2005-01-01          1234            5678    

where the numbers obviously are not correct.

Specifically within an apply()

In order to have a generally applicable way, I was trying to set up

def apply(group):
    result = group.loc[:, pd.IndexSlice[:, 'Job openings']].div(group.loc[:, pd.IndexSlice[:, 'Hires']].values)
    result.columns = pd.MultiIndex.from_product([[group.columns.get_level_values(0)[0]], ['Ratio']])
    return result.values
foo = df.groupby(axis=1, level=0).apply(apply)

Which suffers from two problems:

I first tried to return result, instead of result.values, but that just lead to a data frame full of NaN

Specifically with using the column names

What I don't like about the highest-voted answer is that it requires on .diff() or .div() - hacks, which make the code hard to read and are hard to implement when there's more than two columns at the sub-level.

Upvotes: 3

Views: 1243

Answers (5)

piRSquared
piRSquared

Reputation: 294258

Use groupby and apply

Setup

import pandas as pd

df = pd.DataFrame(
    [
        [5777, 5385, 5419, 4887],
        [4849, 3759, 4539, 3381],
        [4971, 3824, 4645, 3424],
        [4827, 3459, 4552, 3153],
        [5207, 3670, 4876, 3358],
    ],
    index=pd.to_datetime(['2001-01-01',
                          '2002-01-01',
                          '2003-01-01',
                          '2004-01-01',
                          '2005-01-01']),
    columns=pd.MultiIndex.from_tuples(
        [('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
         ('Total private', 'Hires'), ('Total private', 'Job Openings')]
    )
)

print df

Solution

def diff(group):
    g = group.shift().sub(group).dropna()
    g.index = ['Difference']
    return g

def ratio(group):
    g = group.shift().div(group).dropna()
    g.index = ['Ratio']
    return g

def do_nothing(group):
    return group

pd.concat(
    [df.T.groupby(level=0).apply(f).T for f in [diff, ratio, do_nothing]],
    axis=1
).sort_index(axis=1)

           Total nonfarm                          Total private        \
              Difference Hires Job Openings Ratio    Difference Hires   
2001-01-01         392.0  5777         5385  1.07         532.0  5419   
2002-01-01        1090.0  4849         3759  1.29        1158.0  4539   
2003-01-01        1147.0  4971         3824  1.30        1221.0  4645   
2004-01-01        1368.0  4827         3459  1.40        1399.0  4552   
2005-01-01        1537.0  5207         3670  1.42        1518.0  4876   


           Job Openings Ratio  
2001-01-01         4887  1.11  
2002-01-01         3381  1.34  
2003-01-01         3424  1.36  
2004-01-01         3153  1.44  
2005-01-01         3358  1.45  

Upvotes: 1

piRSquared
piRSquared

Reputation: 294258

Another way to go about this is to swap the column levels and use the column accessors.

Setup

import pandas as pd

df = pd.DataFrame(
    [
        [5777, 5385, 5419, 4887],
        [4849, 3759, 4539, 3381],
        [4971, 3824, 4645, 3424],
        [4827, 3459, 4552, 3153],
        [5207, 3670, 4876, 3358],
    ],
    index=pd.to_datetime(['2001-01-01',
                          '2002-01-01',
                          '2003-01-01',
                          '2004-01-01',
                          '2005-01-01']),
    columns=pd.MultiIndex.from_tuples(
        [('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
         ('Total private', 'Hires'), ('Total private', 'Job Openings')]
    )
)

print df
           Total nonfarm              Total private             
                   Hires Job Openings         Hires Job Openings
2001-01-01          5777         5385          5419         4887
2002-01-01          4849         3759          4539         3381
2003-01-01          4971         3824          4645         3424
2004-01-01          4827         3459          4552         3153
2005-01-01          5207         3670          4876         3358

If we swap level then sort, it looks like:

print df.swaplevel(0, 1, 1).sort_index(axis=1)

                   Hires                Job Openings              
           Total nonfarm Total private Total nonfarm Total private
2001-01-01          5777          5419          5385          4887
2002-01-01          4849          4539          3759          3381
2003-01-01          4971          4645          3824          3424
2004-01-01          4827          4552          3459          3153
2005-01-01          5207          4876          3670          3358

With this we can access the Hires with .Hires or ['Hires']. Combining this with your need to subtract:

print df.swaplevel(0, 1, 1)['Hires']

            Total nonfarm  Total private
2001-01-01           5777           5419
2002-01-01           4849           4539
2003-01-01           4971           4645
2004-01-01           4827           4552
2005-01-01           5207           4876

print df.swaplevel(0, 1, 1)['Hires'] - df.swaplevel(0, 1, 1)['Job Openings']

            Total nonfarm  Total private
2001-01-01            392            532
2002-01-01           1090           1158
2003-01-01           1147           1221
2004-01-01           1368           1399
2005-01-01           1537           1518

Solution

Putting it all together with a little extra, I did:

df_ = df.swaplevel(0, 1, 1)

_df = pd.concat([
        df_,
        pd.concat([df_['Hires'] - df_['Job Openings'], df_['Hires'] / df_['Job Openings']],
                 axis=1, keys=['Difference', 'Ratio'])
    ], axis=1)

df = _df.swaplevel(0, 1, 1).sort_index(axis=1)

print df

           Total nonfarm                              Total private        \
              Difference Hires Job Openings     Ratio    Difference Hires   
2001-01-01           392  5777         5385  1.072795           532  5419   
2002-01-01          1090  4849         3759  1.289971          1158  4539   
2003-01-01          1147  4971         3824  1.299948          1221  4645   
2004-01-01          1368  4827         3459  1.395490          1399  4552   
2005-01-01          1537  5207         3670  1.418801          1518  4876   


           Job Openings     Ratio  
2001-01-01         4887  1.108860  
2002-01-01         3381  1.342502  
2003-01-01         3424  1.356600  
2004-01-01         3153  1.443704  
2005-01-01         3358  1.452055 

You could also use xs to grab cross sections.

kw = dict(axis=1, level=1)

df.xs('Hires', **kw) - df.xs('Job Openings', **kw)

            Total nonfarm  Total private
2001-01-01            392            532
2002-01-01           1090           1158
2003-01-01           1147           1221
2004-01-01           1368           1399
2005-01-01           1537           1518

Upvotes: 1

piRSquared
piRSquared

Reputation: 294258

Setup

import pandas as pd

df = pd.DataFrame(
    [
        [5777, 5385, 5419, 4887],
        [4849, 3759, 4539, 3381],
        [4971, 3824, 4645, 3424],
        [4827, 3459, 4552, 3153],
        [5207, 3670, 4876, 3358],
    ],
    index=pd.to_datetime(['2001-01-01',
                          '2002-01-01',
                          '2003-01-01',
                          '2004-01-01',
                          '2005-01-01']),
    columns=pd.MultiIndex.from_tuples(
        [('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
         ('Total private', 'Hires'), ('Total private', 'Job Openings')]
    )
)

print df

           Total nonfarm              Total private             
                   Hires Job Openings         Hires Job Openings
2001-01-01          5777         5385          5419         4887
2002-01-01          4849         3759          4539         3381
2003-01-01          4971         3824          4645         3424
2004-01-01          4827         3459          4552         3153
2005-01-01          5207         3670          4876         3358

Try:

df.T.groupby(level=0).diff(-1).dropna().T

           Total nonfarm Total private
                   Hires         Hires
2001-01-01         392.0         532.0
2002-01-01        1090.0        1158.0
2003-01-01        1147.0        1221.0
2004-01-01        1368.0        1399.0
2005-01-01        1537.0        1518.0

To apply other transforms, say a ratio, you could do:

print df.T.groupby(level=0).apply(lambda x: np.exp(np.log(x).diff(-1))).dropna().T

           Total nonfarm Total private
                   Hires         Hires
2001-01-01      1.072795      1.108860
2002-01-01      1.289971      1.342502
2003-01-01      1.299948      1.356600
2004-01-01      1.395490      1.443704
2005-01-01      1.418801      1.452055

Or:

print df.T.groupby(level=0).apply(lambda x: x.div(x.shift(-1))).dropna().T

           Total nonfarm Total private
                   Hires         Hires
2001-01-01      1.072795      1.108860
2002-01-01      1.289971      1.342502
2003-01-01      1.299948      1.356600
2004-01-01      1.395490      1.443704
2005-01-01      1.418801      1.452055

To rename columns and combine with the original dataframe you can:

df2 = df.T.groupby(level=0).diff(-1).dropna().T
df2.columns = pd.MultiIndex.from_tuples(
    [('Total nonfarm', 'difference'),
     ('Total private', 'difference')])
pd.concat([df, df2], axis=1).sort_index(axis=1)

Looks like:

           Total nonfarm                         Total private               \
                   Hires Job Openings difference         Hires Job Openings   
2001-01-01          5777         5385      392.0          5419         4887   
2002-01-01          4849         3759     1090.0          4539         3381   
2003-01-01          4971         3824     1147.0          4645         3424   
2004-01-01          4827         3459     1368.0          4552         3153   
2005-01-01          5207         3670     1537.0          4876         3358   

           difference  
2001-01-01      532.0  
2002-01-01     1158.0  
2003-01-01     1221.0  
2004-01-01     1399.0  
2005-01-01     1518.0  

Upvotes: 3

Dorian B.
Dorian B.

Reputation: 1289

Let's keep it simple.

In [19]: df['Total nonfarm'] - df['Total private']
Out[19]: 
            Hires  Job Openings
2001-01-01    358           498
2002-01-01    310           378
2003-01-01    326           400
2004-01-01    275           306
2005-01-01    331           312

Upvotes: 1

jezrael
jezrael

Reputation: 862591

I think you can use IndexSlice:

idx = pd.IndexSlice
df[('Total private','difference')] = (df.loc[:, idx[('Total nonfarm', 'Hires')]] - 
                                      df.loc[:, idx[('Total private', 'Hires')]])
print (df)
           Total nonfarm              Total private                        
date               Hires Job openings         Hires Job openings difference
2001-01-01          5777         5385          5419         4887        358
2002-01-01          4849         3759          4539         3381        310
2003-01-01          4971         3824          4645         3424        326
2004-01-01          4827         3459          4552         3153        275
2005-01-01          5207         3670          4876         3358        331

If you want multiple columns, you can use modified piRSquared's answer - you can remove transposing:

print (df.groupby(level=0,axis=1).diff(-1).dropna(1))
           Total nonfarm Total private             
date               Hires         Hires Job openings
2001-01-01         392.0         532.0       4495.0
2002-01-01        1090.0        1158.0       2291.0
2003-01-01        1147.0        1221.0       2277.0
2004-01-01        1368.0        1399.0       1785.0
2005-01-01        1537.0        1518.0       1821.0

Upvotes: 2

Related Questions