CatamountJack
CatamountJack

Reputation: 160

Grouped data to Pandas Multi-Index

I have some data in Excel that's already laid out in groups (it's a Statement of Cash Flows from some accounting software) that I'm reading into Pandas with the read_excel() method. I want to create an index on the first 4 columns, but I can't figure out how to maintain the hierarchy of accounts. Here's what the hierarchy looks like:

Data in Excel

When I try to either create a multi-index from the first 4 columns or if I do a ffill() on them, Pandas (effectively) does this...

Incorrect Multi-Index or ffill() result

The highlighted cells are incorrect for what I need and they should be blank/NaN to maintain the hierarchy. When the subcategory B2 on Level 2 starts, any categories from levels 3 and 4 should not be filled.

Here's what I'm trying to achieve:

Expected result

Ultimately, these are Statements of Cash Flows from multiple years that have slightly different charts of accounts, so my hope was to import them into Pandas DataFrames and merge them so the accounts all line up across the whole time period... this is just the first step in the process. I could code this into a dictionary manually, but am wondering if it's possible more simply within Pandas?

Upvotes: 2

Views: 152

Answers (2)

kleynjan
kleynjan

Reputation: 128

I have a simple utility function for this:

def fillright(row, cols=None, text='DUMMY'):
    """ If empty or NaN, fill given columns to the right with 'text'. """ 
    filling = False
    for c in cols:
        if row[c] is np.nan or row[c]=='':
            if filling==True:
                row[c] = text
        else:
            filling = True
    return row

levels = ['Level 1','Level 2','Level 3','Level 4']
df.apply(fillright, cols=levels, axis=1 )

Intermediate result:

  Level 1 Level 2 Level 3 Level 4 DataCol1 DataCol2
0      A1   DUMMY   DUMMY   DUMMY                  
1              B1   DUMMY   DUMMY                  
2                      C1   DUMMY                  
3                              D1   Value1   Value2
4                              D2   Value3   Value4
5              B2   DUMMY   DUMMY                  
6                      C2   DUMMY                  
7                              D3   Value5   Value6

Now we do a regular ffill and then get rid of the dummy text:

df[levels] = ( df[levels]
    .replace('', np.nan)
    .fillna( method='ffill' )
    .replace('DUMMY', '')
)

Which produces the desired result.

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153460

Here a way by reversing columns order and cumprod on isna:

Given df,

df = pd.DataFrame({'Level 1':['A1']+['']*7,
                  'Level 2':['']+['B1']+['']*3+['B2']+['']*2,
                  'Level 3':['']*2+['C1']+['']*3+['C2']+[''],
                  'Level 4':['']*3+['D1','D2']+['']*2+['D3'],
                  'DataCol1':['']*3+['Value1', 'Value3']+['']*2+['Value5'],
                  'DataCol2':['']*3+['Value2', 'Value4']+['']*2+['Value6']})

Input dataframe:

  Level 1 Level 2 Level 3 Level 4 DataCol1 DataCol2
0      A1                                          
1              B1                                  
2                      C1                          
3                              D1   Value1   Value2
4                              D2   Value3   Value4
5              B2                                  
6                      C2                          
7                              D3   Value5   Value6

Replace '' with np.nan:

df_nans = df.replace('', np.nan)

Create a boolean array, reverse the columns, cumprod on isna; this will be 1 for all NaN until the first non-NaN value then will become 0 after that. And reverse again to reorder columns.

mask_frame = df_nans.loc[:,::-1].isna().cumprod(axis=1).loc[:, ::-1].astype(bool)
print(mask_frame)

Boolean dataframe:

   Level 1  Level 2  Level 3  Level 4  DataCol1  DataCol2
0    False     True     True     True      True      True
1    False    False     True     True      True      True
2    False    False    False     True      True      True
3    False    False    False    False     False     False
4    False    False    False    False     False     False
5    False    False     True     True      True      True
6    False    False    False     True      True      True
7    False    False    False    False     False     False

Forwards fill df_nans and mask values:

df_out = df_nans.ffill().mask(mask_frame)

Output:

  Level 1 Level 2 Level 3 Level 4 DataCol1 DataCol2
0      A1     NaN     NaN     NaN      NaN      NaN
1      A1      B1     NaN     NaN      NaN      NaN
2      A1      B1      C1     NaN      NaN      NaN
3      A1      B1      C1      D1   Value1   Value2
4      A1      B1      C1      D2   Value3   Value4
5      A1      B2     NaN     NaN      NaN      NaN
6      A1      B2      C2     NaN      NaN      NaN
7      A1      B2      C2      D3   Value5   Value6

And, fillna with '',

df_out.fillna('')

Output:

      Level 1 Level 2 Level 3 Level 4 DataCol1 DataCol2
0      A1                                          
1      A1      B1                                  
2      A1      B1      C1                          
3      A1      B1      C1      D1   Value1   Value2
4      A1      B1      C1      D2   Value3   Value4
5      A1      B2                                  
6      A1      B2      C2                          
7      A1      B2      C2      D3   Value5   Value6

Upvotes: 1

Related Questions