Igor Bukanov
Igor Bukanov

Reputation: 5152

MultiIndex and operations between particular groups in pandas and missing values

Consider the following one-column table indexed by pairs (A, B):

       Value
A  B        
a1 b1      1
   b2      3
a2 b1     10
   b2     30
a3 b1    100

For a particular value of B, like b2, I would like to get a table that lists all its values together with an extra column that gives the difference between b2 and b1 if the corresponding value for b1 or b2 exists. So the result should be:

      b2  b2-b1
a1   3.0    2.0
a2  30.0   20.0
a3   NaN    NaN
a4  40.0    NaN

where NaN in both columns for a3 tells that values for both b1 and b2 are missing while NaN for a4 tells that only b1 is missing.

Initially I tried the solution based on Calculation between groups in a Pandas multiindex dataframe :

import numpy as np
import pandas as pd
import itertools

index = pd.MultiIndex.from_tuples([
    ('a1', 'b1'), 
    ('a1', 'b2'), 
    ('a2', 'b1'),
    ('a2', 'b2'),
    ('a2', 'b3'),
    ('a3', 'b1'),
    ('a4', 'b2'),
], names=['A', 'B'])

table = pd.DataFrame({'Value': [1,3, 10, 30, 31, 100, 40]}, index=index)
tmp = pd.concat([table, table.groupby(level='A')['Value'].transform(lambda x: x.loc[:, 'b2'] - x.loc[:, 'b1'])], axis=1)

b2 = tmp.groupby(level='B').get_group('b2')
b2.index = b2.index.droplevel(level='B')

print(b2)

But that gives a table without the a3 row (column names can be fixed with renames):

    Value  Value
A               
a1      3    2.0
a2     30   20.0
a4     40    NaN

The reason for this is that it seems the transform preserves the original index and the extra NaN are not inserted into the table, right? Is it possible to fix this?

So I tried an alternative with explicit diff between columns in groups:

import numpy as np
import pandas as pd
import itertools

index = pd.MultiIndex.from_tuples([
    ('a1', 'b1'), 
    ('a1', 'b2'), 
    ('a2', 'b1'),
    ('a2', 'b2'),
    ('a3', 'b1'),
    ('a4', 'b2'),
], names=['A', 'B'])

input_table = pd.DataFrame({'Value': [1,3, 10, 30, 100, 40]}, index=index)

grouped = input_table.groupby(level='B')
b1 = grouped.get_group('b1')
b1.index = b1.index.droplevel(level='B')

b2 = grouped.get_group('b2')
b2.index = b2.index.droplevel(level='B')
b2 = b2.rename(columns={'Value': 'b2'})

b2 = pd.concat([b2, b2['b2'] - b1['Value']], axis=1)
b2.rename(columns={0: 'b2-b1'}, inplace=True)

print(b2)

This works but it seems rather complicated with all that code to drop levels so the diff between columns work and separated column renames. Is it possible to make it simpler?

Upvotes: 0

Views: 64

Answers (1)

Zero
Zero

Reputation: 76967

Using unstack to get B onto columns and assign to create b2-b1 and drop to remove unneeded columns.

In [1120]: table.unstack()['Value'].assign(b2b1=lambda x: x.b2-x.b1).drop(['b1', 'b3'], 1)
Out[1120]:
B     b2  b2b1
A
a1   3.0   2.0
a2  30.0  20.0
a3   NaN   NaN
a4  40.0   NaN

To get rid of index names use rename_axis

In [1123]: (table.unstack()['Value'].assign(b2b1=lambda x: x.b2-x.b1).drop(['b1', 'b3'], 1)
                 .rename_axis(None).rename_axis(None, 1))
Out[1123]:
      b2  b2b1
a1   3.0   2.0
a2  30.0  20.0
a3   NaN   NaN
a4  40.0   NaN

Or, you can store the unstack result

In [1127]: dff = table.unstack()['Value'].rename_axis(None).rename_axis(None, 1)

In [1128]: dff['b2-b1'] = dff['b2'] - dff['b1']

In [1129]: dff
Out[1129]:
       b1    b2    b3  b2-b1
a1    1.0   3.0   NaN    2.0
a2   10.0  30.0  31.0   20.0
a3  100.0   NaN   NaN    NaN
a4    NaN  40.0   NaN    NaN

In [1189]: dff[['b1', 'b2-b1']]  # get specific columns
Out[1189]:
       b1  b2-b1
a1    1.0    2.0
a2   10.0   20.0
a3  100.0    NaN
a4    NaN    NaN

Details

In [1124]: table.unstack()
Out[1124]:
    Value
B      b1    b2    b3
A
a1    1.0   3.0   NaN
a2   10.0  30.0  31.0
a3  100.0   NaN   NaN
a4    NaN  40.0   NaN

Upvotes: 1

Related Questions