Reputation: 101
Follow up from my previous question -- Creating new columns based on value from another column in pandas
My goal now is :
Code Name Level1 Level1Name Level2 Level2Name Level3 Level3Name
0 A USA A USA
1 AM Massachusetts A USA AM Massachusetts
2 AMB Boston A USA AM Massachusetts AMB Boston
3 AMS Springfield A USA AM Massachusetts AMS Springfiled
4 D Germany D Germany
5 DB Brandenburg D Germany DB Brandenburg
6 DBB Berlin D Germany DB Brandenburg DBB Berlin
7 DBD Dresden D Germany DB Brandenburg DBD Dresden
Building on Scott Boston's code, I have so far:
match 0 1 2
0 A A A
1 A AM AM
2 A AM AMB
3 A AM AMS
4 D D D
5 D DB DB
6 D DB DBB
7 D DB DBD
My approach is to loop through each column and remove row that doesn't have the same length as the rest of the values in that column but cannot seem to figure out the logic.
Sample Code:
df = pd.read_excel(r'/Users/BoBoMann/Desktop/Sequence.xlsx')
df['Codes'] = [[*i] for i in df['Code']]
df_level = df['Code'].str.extractall('(.)')[0].unstack('match').fillna('').cumsum(axis=1)
df_level
Thank you for your help!
Upvotes: 2
Views: 105
Reputation: 295
I took a different approach: loop through the length of code, assuming you will not have too many levels.
import pandas as pd
df=pd.DataFrame({
'Code':['A','AM','AMB'],
'Name':['USA','Massachusetts',"Boston"]
})
# prepare
res=pd.DataFrame({
'Code':[]
})
df['len']=df['Code'].str.len()
cols=[]
for x in range(df['len'].max()):
dfX=df[df['len']==x+1].copy()
dfX['prefix']=dfX['Code'].str.slice(stop=x)
dfX=dfX.merge(res,how='left',left_on='prefix',right_on='Code')
dfX[f'Level{x+1}']=dfX['Code_x']
dfX[f'Level{x+1}Name']=dfX['Name']
dfX[f'Code']=dfX['Code_x']
cols+=[f'Level{x+1}',f'Level{x+1}Name']
res=res.append(dfX[['Code']+cols],sort=False)
res
Code Level1 Level1Name Level2 Level2Name Level3 Level3Name
0 A A USA NaN NaN NaN NaN
0 AM A USA AM Massachusetts NaN NaN
0 AMB A USA AM Massachusetts AMB Boston
the idea is first add level 1 to the lookup table; then lv 2 and 3... code looks ugly, but hopefully easy to follow.
Upvotes: 0
Reputation: 449
This approach uses apply
and functions:
import pandas as pd
l = ['A', 'AM', 'AMB', 'AMS', 'D', 'DB', 'DBB', 'DBD']
df = pd.DataFrame(l).rename(columns={0:'code'})
def level2(col):
if len(col) == 1:
return ''
elif len(col) >= 2:
return col[:2]
def level3(col):
if len(col) <= 2:
return ''
elif len(col) > 2:
return col[:3]
df['Level1'] = df['code'].apply(lambda col: col[0])
df['Level2'] = df['code'].apply(level2)
df['Level3'] = df['code'].apply(level3)
print(df)
Output:
code Level1 Level2 Level3
0 A A
1 AM A AM
2 AMB A AM AMB
3 AMS A AM AMS
4 D D
5 DB D DB
6 DBB D DB DBB
7 DBD D DB DBD
The functions can be refactored into one function too, but you get the gist. I would recommend using apply
over other methods of pandas because apply
is easier to remember and to customize. Hope this helps.
Upvotes: 0
Reputation: 153560
Let's try:
df['Codes'] = [[*i] for i in df['Code']]
df_level = df['Code'].str.extractall('(.)')[0].unstack('match', fill_value='')
df_level = df_level.cumsum(axis=1).mask(df_level == '')
s_map = df.explode('Codes').drop_duplicates('Code', keep='last').set_index('Code')['Name']
df_level.columns = [f'Level{i+1}' for i in df_level.columns]
df_level_names = pd.concat([df_level[i].map(s_map) for i in df_level.columns],
axis=1,
keys=df_level.columns+'Name')
df_out = df.join([df_level, df_level_names]).drop('Codes', axis=1)
df_out
Output:
Code Name Level1 Level2 Level3 Level1Name Level2Name Level3Name
0 A USA A NaN NaN USA NaN NaN
1 AM Massachusetts A AM NaN USA Massachusetts NaN
2 AMB Boston A AM AMB USA Massachusetts Boston
3 AMS Springfield A AM AMS USA Massachusetts Springfield
4 D Germany D NaN NaN Germany NaN NaN
5 DB Brandenburg D DB NaN Germany Brandenburg NaN
6 DBB Berlin D DB DBB Germany Brandenburg Berlin
7 DBD Dresden D DB DBD Germany Brandenburg Dresden
Upvotes: 1