BoBoMann
BoBoMann

Reputation: 101

Follow up - Creating new columns based on value from another column in pandas

Follow up from my previous question -- Creating new columns based on value from another column in pandas

My goal now is :

Code    Name        Level1    Level1Name    Level2  Level2Name  Level3  Level3Name
0   A   USA             A       USA             
1   AM  Massachusetts   A       USA          AM     Massachusetts   
2   AMB Boston          A       USA          AM     Massachusetts   AMB     Boston
3   AMS Springfield     A       USA          AM     Massachusetts   AMS     Springfiled
4   D   Germany         D   Germany          
5   DB  Brandenburg     D   Germany          DB     Brandenburg     
6   DBB     Berlin      D   Germany          DB     Brandenburg     DBB     Berlin
7   DBD     Dresden     D   Germany          DB     Brandenburg     DBD     Dresden

Building on Scott Boston's code, I have so far:

match   0   1   2
0       A   A   A
1       A   AM  AM
2       A   AM  AMB
3       A   AM  AMS
4       D   D   D
5       D   DB  DB
6       D   DB  DBB
7       D   DB  DBD

My approach is to loop through each column and remove row that doesn't have the same length as the rest of the values in that column but cannot seem to figure out the logic.

Sample Code:

df = pd.read_excel(r'/Users/BoBoMann/Desktop/Sequence.xlsx')

df['Codes'] = [[*i] for i in df['Code']]
df_level = df['Code'].str.extractall('(.)')[0].unstack('match').fillna('').cumsum(axis=1)
df_level

Thank you for your help!

Upvotes: 2

Views: 105

Answers (3)

Yuan Ren
Yuan Ren

Reputation: 295

I took a different approach: loop through the length of code, assuming you will not have too many levels.

import pandas as pd
df=pd.DataFrame({
    'Code':['A','AM','AMB'],
    'Name':['USA','Massachusetts',"Boston"]
})
# prepare
res=pd.DataFrame({
    'Code':[]
})
df['len']=df['Code'].str.len()
cols=[]
for x in range(df['len'].max()):
    dfX=df[df['len']==x+1].copy()
    dfX['prefix']=dfX['Code'].str.slice(stop=x)

    dfX=dfX.merge(res,how='left',left_on='prefix',right_on='Code')

    dfX[f'Level{x+1}']=dfX['Code_x']
    dfX[f'Level{x+1}Name']=dfX['Name']
    dfX[f'Code']=dfX['Code_x']
    cols+=[f'Level{x+1}',f'Level{x+1}Name']
    res=res.append(dfX[['Code']+cols],sort=False)

res

Code    Level1  Level1Name  Level2  Level2Name  Level3  Level3Name
0   A   A   USA NaN NaN NaN NaN
0   AM  A   USA AM  Massachusetts   NaN NaN
0   AMB A   USA AM  Massachusetts   AMB Boston

the idea is first add level 1 to the lookup table; then lv 2 and 3... code looks ugly, but hopefully easy to follow.

Upvotes: 0

Petar Luketina
Petar Luketina

Reputation: 449

This approach uses apply and functions:

import pandas as pd
l = ['A', 'AM', 'AMB', 'AMS', 'D', 'DB', 'DBB', 'DBD']
df = pd.DataFrame(l).rename(columns={0:'code'})

def level2(col):
  if len(col) == 1:
    return ''
  elif len(col) >= 2:
    return col[:2]

def level3(col):
  if len(col) <= 2:
    return ''
  elif len(col) > 2:
    return col[:3]

df['Level1'] = df['code'].apply(lambda col: col[0])
df['Level2'] = df['code'].apply(level2)
df['Level3'] = df['code'].apply(level3)

print(df)

Output:

  code Level1 Level2 Level3
0    A      A              
1   AM      A     AM       
2  AMB      A     AM    AMB
3  AMS      A     AM    AMS
4    D      D              
5   DB      D     DB       
6  DBB      D     DB    DBB
7  DBD      D     DB    DBD

The functions can be refactored into one function too, but you get the gist. I would recommend using apply over other methods of pandas because apply is easier to remember and to customize. Hope this helps.

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153560

Let's try:

df['Codes'] = [[*i] for i in df['Code']]
df_level = df['Code'].str.extractall('(.)')[0].unstack('match', fill_value='')
df_level = df_level.cumsum(axis=1).mask(df_level == '')
s_map = df.explode('Codes').drop_duplicates('Code', keep='last').set_index('Code')['Name']
df_level.columns = [f'Level{i+1}' for i in df_level.columns]
df_level_names =  pd.concat([df_level[i].map(s_map) for i in df_level.columns], 
                            axis=1, 
                            keys=df_level.columns+'Name')
df_out = df.join([df_level, df_level_names]).drop('Codes', axis=1)
df_out

Output:

  Code           Name Level1 Level2 Level3 Level1Name     Level2Name   Level3Name
0    A            USA      A    NaN    NaN        USA            NaN          NaN
1   AM  Massachusetts      A     AM    NaN        USA  Massachusetts          NaN
2  AMB         Boston      A     AM    AMB        USA  Massachusetts       Boston
3  AMS    Springfield      A     AM    AMS        USA  Massachusetts  Springfield
4    D        Germany      D    NaN    NaN    Germany            NaN          NaN
5   DB    Brandenburg      D     DB    NaN    Germany    Brandenburg          NaN
6  DBB         Berlin      D     DB    DBB    Germany    Brandenburg       Berlin
7  DBD        Dresden      D     DB    DBD    Germany    Brandenburg      Dresden

Upvotes: 1

Related Questions