Grouped data to Pandas Multi-Index

Question

I have some data in Excel that's already laid out in groups (it's a Statement of Cash Flows from some accounting software) that I'm reading into Pandas with the read_excel() method. I want to create an index on the first 4 columns, but I can't figure out how to maintain the hierarchy of accounts. Here's what the hierarchy looks like:

When I try to either create a multi-index from the first 4 columns or if I do a ffill() on them, Pandas (effectively) does this...

The highlighted cells are incorrect for what I need and they should be blank/NaN to maintain the hierarchy. When the subcategory B2 on Level 2 starts, any categories from levels 3 and 4 should not be filled.

Here's what I'm trying to achieve:

Ultimately, these are Statements of Cash Flows from multiple years that have slightly different charts of accounts, so my hope was to import them into Pandas DataFrames and merge them so the accounts all line up across the whole time period... this is just the first step in the process. I could code this into a dictionary manually, but am wondering if it's possible more simply within Pandas?

Scott Boston · Accepted Answer

Here a way by reversing columns order and cumprod on isna:

Given df,

df = pd.DataFrame({'Level 1':['A1']+['']*7,
                  'Level 2':['']+['B1']+['']*3+['B2']+['']*2,
                  'Level 3':['']*2+['C1']+['']*3+['C2']+[''],
                  'Level 4':['']*3+['D1','D2']+['']*2+['D3'],
                  'DataCol1':['']*3+['Value1', 'Value3']+['']*2+['Value5'],
                  'DataCol2':['']*3+['Value2', 'Value4']+['']*2+['Value6']})

Input dataframe:

  Level 1 Level 2 Level 3 Level 4 DataCol1 DataCol2
0      A1                                          
1              B1                                  
2                      C1                          
3                              D1   Value1   Value2
4                              D2   Value3   Value4
5              B2                                  
6                      C2                          
7                              D3   Value5   Value6

Replace '' with np.nan:

df_nans = df.replace('', np.nan)

Create a boolean array, reverse the columns, cumprod on isna; this will be 1 for all NaN until the first non-NaN value then will become 0 after that. And reverse again to reorder columns.

mask_frame = df_nans.loc[:,::-1].isna().cumprod(axis=1).loc[:, ::-1].astype(bool)
print(mask_frame)

Boolean dataframe:

   Level 1  Level 2  Level 3  Level 4  DataCol1  DataCol2
0    False     True     True     True      True      True
1    False    False     True     True      True      True
2    False    False    False     True      True      True
3    False    False    False    False     False     False
4    False    False    False    False     False     False
5    False    False     True     True      True      True
6    False    False    False     True      True      True
7    False    False    False    False     False     False

Forwards fill df_nans and mask values:

df_out = df_nans.ffill().mask(mask_frame)

Output:

  Level 1 Level 2 Level 3 Level 4 DataCol1 DataCol2
0      A1     NaN     NaN     NaN      NaN      NaN
1      A1      B1     NaN     NaN      NaN      NaN
2      A1      B1      C1     NaN      NaN      NaN
3      A1      B1      C1      D1   Value1   Value2
4      A1      B1      C1      D2   Value3   Value4
5      A1      B2     NaN     NaN      NaN      NaN
6      A1      B2      C2     NaN      NaN      NaN
7      A1      B2      C2      D3   Value5   Value6

And, fillna with '',

df_out.fillna('')

Output:

      Level 1 Level 2 Level 3 Level 4 DataCol1 DataCol2
0      A1                                          
1      A1      B1                                  
2      A1      B1      C1                          
3      A1      B1      C1      D1   Value1   Value2
4      A1      B1      C1      D2   Value3   Value4
5      A1      B2                                  
6      A1      B2      C2                          
7      A1      B2      C2      D3   Value5   Value6

Grouped data to Pandas Multi-Index

Answers (2)

Related Questions