Reputation: 5152
Consider the following one-column table indexed by pairs (A, B):
Value
A B
a1 b1 1
b2 3
a2 b1 10
b2 30
a3 b1 100
For a particular value of B, like b2, I would like to get a table that lists all its values together with an extra column that gives the difference between b2 and b1 if the corresponding value for b1 or b2 exists. So the result should be:
b2 b2-b1
a1 3.0 2.0
a2 30.0 20.0
a3 NaN NaN
a4 40.0 NaN
where NaN in both columns for a3 tells that values for both b1 and b2 are missing while NaN for a4 tells that only b1 is missing.
Initially I tried the solution based on Calculation between groups in a Pandas multiindex dataframe :
import numpy as np
import pandas as pd
import itertools
index = pd.MultiIndex.from_tuples([
('a1', 'b1'),
('a1', 'b2'),
('a2', 'b1'),
('a2', 'b2'),
('a2', 'b3'),
('a3', 'b1'),
('a4', 'b2'),
], names=['A', 'B'])
table = pd.DataFrame({'Value': [1,3, 10, 30, 31, 100, 40]}, index=index)
tmp = pd.concat([table, table.groupby(level='A')['Value'].transform(lambda x: x.loc[:, 'b2'] - x.loc[:, 'b1'])], axis=1)
b2 = tmp.groupby(level='B').get_group('b2')
b2.index = b2.index.droplevel(level='B')
print(b2)
But that gives a table without the a3 row (column names can be fixed with renames):
Value Value
A
a1 3 2.0
a2 30 20.0
a4 40 NaN
The reason for this is that it seems the transform preserves the original index and the extra NaN are not inserted into the table, right? Is it possible to fix this?
So I tried an alternative with explicit diff between columns in groups:
import numpy as np
import pandas as pd
import itertools
index = pd.MultiIndex.from_tuples([
('a1', 'b1'),
('a1', 'b2'),
('a2', 'b1'),
('a2', 'b2'),
('a3', 'b1'),
('a4', 'b2'),
], names=['A', 'B'])
input_table = pd.DataFrame({'Value': [1,3, 10, 30, 100, 40]}, index=index)
grouped = input_table.groupby(level='B')
b1 = grouped.get_group('b1')
b1.index = b1.index.droplevel(level='B')
b2 = grouped.get_group('b2')
b2.index = b2.index.droplevel(level='B')
b2 = b2.rename(columns={'Value': 'b2'})
b2 = pd.concat([b2, b2['b2'] - b1['Value']], axis=1)
b2.rename(columns={0: 'b2-b1'}, inplace=True)
print(b2)
This works but it seems rather complicated with all that code to drop levels so the diff between columns work and separated column renames. Is it possible to make it simpler?
Upvotes: 0
Views: 64
Reputation: 76967
Using unstack
to get B
onto columns and assign
to create b2-b1
and drop
to remove unneeded columns.
In [1120]: table.unstack()['Value'].assign(b2b1=lambda x: x.b2-x.b1).drop(['b1', 'b3'], 1)
Out[1120]:
B b2 b2b1
A
a1 3.0 2.0
a2 30.0 20.0
a3 NaN NaN
a4 40.0 NaN
To get rid of index names use rename_axis
In [1123]: (table.unstack()['Value'].assign(b2b1=lambda x: x.b2-x.b1).drop(['b1', 'b3'], 1)
.rename_axis(None).rename_axis(None, 1))
Out[1123]:
b2 b2b1
a1 3.0 2.0
a2 30.0 20.0
a3 NaN NaN
a4 40.0 NaN
Or, you can store the unstack
result
In [1127]: dff = table.unstack()['Value'].rename_axis(None).rename_axis(None, 1)
In [1128]: dff['b2-b1'] = dff['b2'] - dff['b1']
In [1129]: dff
Out[1129]:
b1 b2 b3 b2-b1
a1 1.0 3.0 NaN 2.0
a2 10.0 30.0 31.0 20.0
a3 100.0 NaN NaN NaN
a4 NaN 40.0 NaN NaN
In [1189]: dff[['b1', 'b2-b1']] # get specific columns
Out[1189]:
b1 b2-b1
a1 1.0 2.0
a2 10.0 20.0
a3 100.0 NaN
a4 NaN NaN
Details
In [1124]: table.unstack()
Out[1124]:
Value
B b1 b2 b3
A
a1 1.0 3.0 NaN
a2 10.0 30.0 31.0
a3 100.0 NaN NaN
a4 NaN 40.0 NaN
Upvotes: 1