Reputation: 75080
I have a file from SAP which wasn't the prettiest one when dealing with data. So using series.str.contains()
and boolean masks I have managed to narrow down to a dataframe looking like below:
0 1
0 SUB 123
1 CAT SKU
2 CODE 1000123
3 CODE 1000234
4 SUB 456
5 CAT LIQ
6 CODE1 1000345
7 CODE1 1000534
8 CODE1 1000433
I am looking for a way where I can separate each SUB
into a new entry like below:
print(expected_df)
SUB CAT CODE CODE1
0 123 SKU 1000123.0 NaN
1 123 SKU 1000234.0 NaN
2 456 LIQ NaN 1000345.0
3 456 LIQ NaN 1000534.0
4 456 LIQ NaN 1000433.0
I just cant seem to get pass this step. However, this line:
df[0].eq('SUB').cumsum()
helps to segregate the groups and can be used as a helper series if needed.
Any help in transposing the data as shown would be really appreciated.
Thanks.
Upvotes: 3
Views: 318
Reputation: 4607
You can try of using df.pivot
followed by .ffill(),bfill() for the specific 'SUB' column group rows.
df1 = df.pivot(columns='0')
df1.columns = df1.columns.map(lambda x: x[1])
df1.SUB = df1.SUB.ffill()
df1.groupby('SUB').ffill().groupby('SUB').bfill().drop_duplicates()
#5.89 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# as time constraints, without use of lambda operation
#df1.groupby(df1.SUB.ffill()).apply(lambda x: x.ffill().bfill()).drop_duplicates()
#16 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Out:
SUB CAT CODE CODE1 SUB
2 123 SKU 1000123 NaN 123
3 123 SKU 1000234 NaN 123
6 456 LIQ NaN 1000345 456
7 456 LIQ NaN 1000534 456
8 456 LIQ NaN 1000433 456
Upvotes: 2
Reputation: 323236
IIUC
l=[y.set_index('0').T.set_index(['SUB','CAT']).stack() for x , y in df.groupby(df['0'].eq('SUB').cumsum())]
s=pd.concat(l).to_frame('v')
s.assign(key=s.groupby(level=[0,1,2]).cumcount()).set_index('key',append=True).unstack(2)
v
0 CODE CODE1
SUB CAT key
123 SKU 0 1000123 NaN
1 1000234 NaN
456 LIQ 0 NaN 1000345
1 NaN 1000534
2 NaN 1000433
Upvotes: 2
Reputation: 59274
IIUC,
df.set_index('col1').groupby(df.col1.eq('SUB').cumsum().values).apply(lambda s: pd.DataFrame({
'SUB': s.loc['SUB'].item(),
'CAT': s.loc['CAT'].item(),
s.index[2]: s.loc[s.index[2]].col2.tolist()
})).reset_index(drop=True)
Outputs
SUB CAT CODE CODE1
0 123 SKU 1000123 NaN
1 123 SKU 1000234 NaN
2 456 LIQ NaN 1000345
3 456 LIQ NaN 1000534
4 456 LIQ NaN 1000433
However, this looks like an XY problem. Maybe it's worth taking a look into how you ended up with this df
in the first place
Upvotes: 2