Reputation: 16478
I have MultiColumns: the second level repetitively contains Job Openings
and Hires
. I would like to subtract one from another for each of the top-level columns - but all I try gets me into index-errors or slice errors. How can I compute it?
Sample data:
>>> df.head()
Out[25]:
Total nonfarm Total private
Hires Job openings Hires Job openings
date
2001-01-01 5777 5385 5419 4887
2002-01-01 4849 3759 4539 3381
2003-01-01 4971 3824 4645 3424
2004-01-01 4827 3459 4552 3153
2005-01-01 5207 3670 4876 3358
expected output:
Out[25]:
Total nonfarm Total private
difference difference
date
2001-01-01 1234 5678
2002-01-01 1234 5678
2003-01-01 1234 5678
2004-01-01 1234 5678
2005-01-01 1234 5678
where the numbers obviously are not correct.
In order to have a generally applicable way, I was trying to set up
def apply(group):
result = group.loc[:, pd.IndexSlice[:, 'Job openings']].div(group.loc[:, pd.IndexSlice[:, 'Hires']].values)
result.columns = pd.MultiIndex.from_product([[group.columns.get_level_values(0)[0]], ['Ratio']])
return result.values
foo = df.groupby(axis=1, level=0).apply(apply)
Which suffers from two problems:
.values
in order to get the divide properlyfoo
is not a proper dataframe:
Accommodation and food services [[0.76], [0.480349344978], [0.501388888889], [... Arts, entertainment, and recreation [[0.558139534884], [0.46017699115], [0.2483221... Construction [[0.35], [0.274881516588], [0.267260579065], [...
I first tried to return result
, instead of result.values
, but that just lead to a data frame full of NaN
What I don't like about the highest-voted answer is that it requires on .diff()
or .div()
- hacks, which make the code hard to read and are hard to implement when there's more than two columns at the sub-level.
Upvotes: 3
Views: 1243
Reputation: 294258
Use groupby
and apply
import pandas as pd
df = pd.DataFrame(
[
[5777, 5385, 5419, 4887],
[4849, 3759, 4539, 3381],
[4971, 3824, 4645, 3424],
[4827, 3459, 4552, 3153],
[5207, 3670, 4876, 3358],
],
index=pd.to_datetime(['2001-01-01',
'2002-01-01',
'2003-01-01',
'2004-01-01',
'2005-01-01']),
columns=pd.MultiIndex.from_tuples(
[('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
('Total private', 'Hires'), ('Total private', 'Job Openings')]
)
)
print df
def diff(group):
g = group.shift().sub(group).dropna()
g.index = ['Difference']
return g
def ratio(group):
g = group.shift().div(group).dropna()
g.index = ['Ratio']
return g
def do_nothing(group):
return group
pd.concat(
[df.T.groupby(level=0).apply(f).T for f in [diff, ratio, do_nothing]],
axis=1
).sort_index(axis=1)
Total nonfarm Total private \
Difference Hires Job Openings Ratio Difference Hires
2001-01-01 392.0 5777 5385 1.07 532.0 5419
2002-01-01 1090.0 4849 3759 1.29 1158.0 4539
2003-01-01 1147.0 4971 3824 1.30 1221.0 4645
2004-01-01 1368.0 4827 3459 1.40 1399.0 4552
2005-01-01 1537.0 5207 3670 1.42 1518.0 4876
Job Openings Ratio
2001-01-01 4887 1.11
2002-01-01 3381 1.34
2003-01-01 3424 1.36
2004-01-01 3153 1.44
2005-01-01 3358 1.45
Upvotes: 1
Reputation: 294258
Another way to go about this is to swap the column levels and use the column accessors.
import pandas as pd
df = pd.DataFrame(
[
[5777, 5385, 5419, 4887],
[4849, 3759, 4539, 3381],
[4971, 3824, 4645, 3424],
[4827, 3459, 4552, 3153],
[5207, 3670, 4876, 3358],
],
index=pd.to_datetime(['2001-01-01',
'2002-01-01',
'2003-01-01',
'2004-01-01',
'2005-01-01']),
columns=pd.MultiIndex.from_tuples(
[('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
('Total private', 'Hires'), ('Total private', 'Job Openings')]
)
)
print df
Total nonfarm Total private
Hires Job Openings Hires Job Openings
2001-01-01 5777 5385 5419 4887
2002-01-01 4849 3759 4539 3381
2003-01-01 4971 3824 4645 3424
2004-01-01 4827 3459 4552 3153
2005-01-01 5207 3670 4876 3358
If we swap level then sort, it looks like:
print df.swaplevel(0, 1, 1).sort_index(axis=1)
Hires Job Openings
Total nonfarm Total private Total nonfarm Total private
2001-01-01 5777 5419 5385 4887
2002-01-01 4849 4539 3759 3381
2003-01-01 4971 4645 3824 3424
2004-01-01 4827 4552 3459 3153
2005-01-01 5207 4876 3670 3358
With this we can access the Hires with .Hires
or ['Hires']
. Combining this with your need to subtract:
print df.swaplevel(0, 1, 1)['Hires']
Total nonfarm Total private
2001-01-01 5777 5419
2002-01-01 4849 4539
2003-01-01 4971 4645
2004-01-01 4827 4552
2005-01-01 5207 4876
print df.swaplevel(0, 1, 1)['Hires'] - df.swaplevel(0, 1, 1)['Job Openings']
Total nonfarm Total private
2001-01-01 392 532
2002-01-01 1090 1158
2003-01-01 1147 1221
2004-01-01 1368 1399
2005-01-01 1537 1518
Putting it all together with a little extra, I did:
df_ = df.swaplevel(0, 1, 1)
_df = pd.concat([
df_,
pd.concat([df_['Hires'] - df_['Job Openings'], df_['Hires'] / df_['Job Openings']],
axis=1, keys=['Difference', 'Ratio'])
], axis=1)
df = _df.swaplevel(0, 1, 1).sort_index(axis=1)
print df
Total nonfarm Total private \
Difference Hires Job Openings Ratio Difference Hires
2001-01-01 392 5777 5385 1.072795 532 5419
2002-01-01 1090 4849 3759 1.289971 1158 4539
2003-01-01 1147 4971 3824 1.299948 1221 4645
2004-01-01 1368 4827 3459 1.395490 1399 4552
2005-01-01 1537 5207 3670 1.418801 1518 4876
Job Openings Ratio
2001-01-01 4887 1.108860
2002-01-01 3381 1.342502
2003-01-01 3424 1.356600
2004-01-01 3153 1.443704
2005-01-01 3358 1.452055
You could also use xs
to grab cross sections.
kw = dict(axis=1, level=1)
df.xs('Hires', **kw) - df.xs('Job Openings', **kw)
Total nonfarm Total private
2001-01-01 392 532
2002-01-01 1090 1158
2003-01-01 1147 1221
2004-01-01 1368 1399
2005-01-01 1537 1518
Upvotes: 1
Reputation: 294258
import pandas as pd
df = pd.DataFrame(
[
[5777, 5385, 5419, 4887],
[4849, 3759, 4539, 3381],
[4971, 3824, 4645, 3424],
[4827, 3459, 4552, 3153],
[5207, 3670, 4876, 3358],
],
index=pd.to_datetime(['2001-01-01',
'2002-01-01',
'2003-01-01',
'2004-01-01',
'2005-01-01']),
columns=pd.MultiIndex.from_tuples(
[('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
('Total private', 'Hires'), ('Total private', 'Job Openings')]
)
)
print df
Total nonfarm Total private
Hires Job Openings Hires Job Openings
2001-01-01 5777 5385 5419 4887
2002-01-01 4849 3759 4539 3381
2003-01-01 4971 3824 4645 3424
2004-01-01 4827 3459 4552 3153
2005-01-01 5207 3670 4876 3358
Try:
df.T.groupby(level=0).diff(-1).dropna().T
Total nonfarm Total private
Hires Hires
2001-01-01 392.0 532.0
2002-01-01 1090.0 1158.0
2003-01-01 1147.0 1221.0
2004-01-01 1368.0 1399.0
2005-01-01 1537.0 1518.0
To apply other transforms, say a ratio, you could do:
print df.T.groupby(level=0).apply(lambda x: np.exp(np.log(x).diff(-1))).dropna().T
Total nonfarm Total private
Hires Hires
2001-01-01 1.072795 1.108860
2002-01-01 1.289971 1.342502
2003-01-01 1.299948 1.356600
2004-01-01 1.395490 1.443704
2005-01-01 1.418801 1.452055
Or:
print df.T.groupby(level=0).apply(lambda x: x.div(x.shift(-1))).dropna().T
Total nonfarm Total private
Hires Hires
2001-01-01 1.072795 1.108860
2002-01-01 1.289971 1.342502
2003-01-01 1.299948 1.356600
2004-01-01 1.395490 1.443704
2005-01-01 1.418801 1.452055
To rename columns and combine with the original dataframe you can:
df2 = df.T.groupby(level=0).diff(-1).dropna().T
df2.columns = pd.MultiIndex.from_tuples(
[('Total nonfarm', 'difference'),
('Total private', 'difference')])
pd.concat([df, df2], axis=1).sort_index(axis=1)
Looks like:
Total nonfarm Total private \
Hires Job Openings difference Hires Job Openings
2001-01-01 5777 5385 392.0 5419 4887
2002-01-01 4849 3759 1090.0 4539 3381
2003-01-01 4971 3824 1147.0 4645 3424
2004-01-01 4827 3459 1368.0 4552 3153
2005-01-01 5207 3670 1537.0 4876 3358
difference
2001-01-01 532.0
2002-01-01 1158.0
2003-01-01 1221.0
2004-01-01 1399.0
2005-01-01 1518.0
Upvotes: 3
Reputation: 1289
Let's keep it simple.
In [19]: df['Total nonfarm'] - df['Total private']
Out[19]:
Hires Job Openings
2001-01-01 358 498
2002-01-01 310 378
2003-01-01 326 400
2004-01-01 275 306
2005-01-01 331 312
Upvotes: 1
Reputation: 862591
I think you can use IndexSlice:
idx = pd.IndexSlice
df[('Total private','difference')] = (df.loc[:, idx[('Total nonfarm', 'Hires')]] -
df.loc[:, idx[('Total private', 'Hires')]])
print (df)
Total nonfarm Total private
date Hires Job openings Hires Job openings difference
2001-01-01 5777 5385 5419 4887 358
2002-01-01 4849 3759 4539 3381 310
2003-01-01 4971 3824 4645 3424 326
2004-01-01 4827 3459 4552 3153 275
2005-01-01 5207 3670 4876 3358 331
If you want multiple columns, you can use modified piRSquared's answer - you can remove transposing:
print (df.groupby(level=0,axis=1).diff(-1).dropna(1))
Total nonfarm Total private
date Hires Hires Job openings
2001-01-01 392.0 532.0 4495.0
2002-01-01 1090.0 1158.0 2291.0
2003-01-01 1147.0 1221.0 2277.0
2004-01-01 1368.0 1399.0 1785.0
2005-01-01 1537.0 1518.0 1821.0
Upvotes: 2