Reputation: 160
I have some data in Excel that's already laid out in groups (it's a Statement of Cash Flows from some accounting software) that I'm reading into Pandas with the read_excel()
method. I want to create an index on the first 4 columns, but I can't figure out how to maintain the hierarchy of accounts. Here's what the hierarchy looks like:
When I try to either create a multi-index from the first 4 columns or if I do a ffill()
on them, Pandas (effectively) does this...
The highlighted cells are incorrect for what I need and they should be blank/NaN to maintain the hierarchy. When the subcategory B2 on Level 2 starts, any categories from levels 3 and 4 should not be filled.
Here's what I'm trying to achieve:
Ultimately, these are Statements of Cash Flows from multiple years that have slightly different charts of accounts, so my hope was to import them into Pandas DataFrames and merge them so the accounts all line up across the whole time period... this is just the first step in the process. I could code this into a dictionary manually, but am wondering if it's possible more simply within Pandas?
Upvotes: 2
Views: 152
Reputation: 128
I have a simple utility function for this:
def fillright(row, cols=None, text='DUMMY'):
""" If empty or NaN, fill given columns to the right with 'text'. """
filling = False
for c in cols:
if row[c] is np.nan or row[c]=='':
if filling==True:
row[c] = text
else:
filling = True
return row
levels = ['Level 1','Level 2','Level 3','Level 4']
df.apply(fillright, cols=levels, axis=1 )
Intermediate result:
Level 1 Level 2 Level 3 Level 4 DataCol1 DataCol2
0 A1 DUMMY DUMMY DUMMY
1 B1 DUMMY DUMMY
2 C1 DUMMY
3 D1 Value1 Value2
4 D2 Value3 Value4
5 B2 DUMMY DUMMY
6 C2 DUMMY
7 D3 Value5 Value6
Now we do a regular ffill and then get rid of the dummy text:
df[levels] = ( df[levels]
.replace('', np.nan)
.fillna( method='ffill' )
.replace('DUMMY', '')
)
Which produces the desired result.
Upvotes: 0
Reputation: 153460
Here a way by reversing columns order and cumprod
on isna
:
Given df,
df = pd.DataFrame({'Level 1':['A1']+['']*7,
'Level 2':['']+['B1']+['']*3+['B2']+['']*2,
'Level 3':['']*2+['C1']+['']*3+['C2']+[''],
'Level 4':['']*3+['D1','D2']+['']*2+['D3'],
'DataCol1':['']*3+['Value1', 'Value3']+['']*2+['Value5'],
'DataCol2':['']*3+['Value2', 'Value4']+['']*2+['Value6']})
Input dataframe:
Level 1 Level 2 Level 3 Level 4 DataCol1 DataCol2
0 A1
1 B1
2 C1
3 D1 Value1 Value2
4 D2 Value3 Value4
5 B2
6 C2
7 D3 Value5 Value6
Replace '' with np.nan:
df_nans = df.replace('', np.nan)
Create a boolean array, reverse the columns, cumprod on isna; this will be 1 for all NaN until the first non-NaN value then will become 0 after that. And reverse again to reorder columns.
mask_frame = df_nans.loc[:,::-1].isna().cumprod(axis=1).loc[:, ::-1].astype(bool)
print(mask_frame)
Boolean dataframe:
Level 1 Level 2 Level 3 Level 4 DataCol1 DataCol2
0 False True True True True True
1 False False True True True True
2 False False False True True True
3 False False False False False False
4 False False False False False False
5 False False True True True True
6 False False False True True True
7 False False False False False False
Forwards fill df_nans and mask values:
df_out = df_nans.ffill().mask(mask_frame)
Output:
Level 1 Level 2 Level 3 Level 4 DataCol1 DataCol2
0 A1 NaN NaN NaN NaN NaN
1 A1 B1 NaN NaN NaN NaN
2 A1 B1 C1 NaN NaN NaN
3 A1 B1 C1 D1 Value1 Value2
4 A1 B1 C1 D2 Value3 Value4
5 A1 B2 NaN NaN NaN NaN
6 A1 B2 C2 NaN NaN NaN
7 A1 B2 C2 D3 Value5 Value6
And, fillna with '',
df_out.fillna('')
Output:
Level 1 Level 2 Level 3 Level 4 DataCol1 DataCol2
0 A1
1 A1 B1
2 A1 B1 C1
3 A1 B1 C1 D1 Value1 Value2
4 A1 B1 C1 D2 Value3 Value4
5 A1 B2
6 A1 B2 C2
7 A1 B2 C2 D3 Value5 Value6
Upvotes: 1