TeoK
TeoK

Reputation: 501

split the column with list into multiple columns in dataframe

I know there have been many questions on this topic, but still:
My input: as dataframe

 task                                            m_label
0  S101-10061  [Cecum Landmark, ICV, Comment, Appendiceal ori...
1  S101-10069  [Rectum RF, ICV, Cecum Landmark, TI, Comment, ...
2  S101-10078  [Appendiceal orifice, ICV, Cecum Landmark, Com...
3  S101-10088  [Cecum Landmark, ICV, Comment, Appendiceal ori...
4  S101-10100  [Transverse, Appendiceal orifice, ICV, Cecum L...
5  S101-10102  [Rectum RF, ICV, Cecum Landmark, Comment, TI, ...
6  S101-10133  [Rectum RF, Transverse, ICV, Cecum Landmark, C...
7  S101YGBgZ2                                          [Comment]

I wan to split like df.m_label.str.split("",expand=True) but it return NaN Maybe problem with df? I get its from panda Series from: m_lab_task=data.groupby(['task'])['m_label'].unique(). So maybe in previously step it possible correct?

Required output:

      task       m_label1 m_label2 m_label3 m_label4 m_label5 m_label6
0   S101-10061  Cecum Landmark ICV Comment Appendiceal orifice
1   S101-10069  Rectum RF ICV Cecum Landmark TI Comment Transverse
2   S101-10078  Appendiceal orifice ICV Cecum Landmark Comment Transverse
 Rectum RF
   

Upvotes: 0

Views: 86

Answers (4)

Rinshan Kolayil
Rinshan Kolayil

Reputation: 1139

From Mr.Brarath Narayan code, I hope you can short the code as follows without use of numpy

df = pd.concat([df['task'],df['m_label'].apply(pd.Series).add_prefix("m_label").fillna("")], axis = 1)

Output

task    m_label0    m_label1    m_label2    m_label3    m_label4    m_label5    m_label6
0   S101-10061  Cecum Landmark  ICV Comment Appendiceal orifice         
1   S101-10069  Rectum RF   ICV Cecum Landmark  TI  Comment Transverse  
2   S101-10078  Appendiceal orifice ICV Cecum Landmark  Comment Transverse  Rectum RF   
3   S101-10088  Cecum Landmark  ICV Comment Appendiceal orifice         
4   S101-10100  Transverse  Appendiceal orifice ICV Cecum Landmark  Comment     
5   S101-10102  Rectum RF   ICV Cecum Landmark  Comment TI  Transverse  Appendiceal orifice
6   S101-10133  Rectum RF   Transverse  ICV Cecum Landmark  Comment     
7   S101YGBgZ2  Comment                     

Upvotes: 0

Bharat Natrayn
Bharat Natrayn

Reputation: 130

when you convert the list into dataframe string data without separation will combine as single data to overcome this you have insert the comma before convert into dataframe like this,

import pandas as pd
data={"task":["S101-10061","S101-10069","S101-10078","S101-10088","S101-10100","S101-10102","S101-10133","S101YGBgZ2"],
     "m_label":[['Cecum Landmark','ICV' ,'Comment' ,'Appendiceal orifice'],['Rectum RF','ICV','Cecum Landmark','TI','Comment','Transverse']
               ,['Appendiceal orifice' ,'ICV' ,'Cecum Landmark', 'Comment', 'Transverse','Rectum RF'],['Cecum Landmark', 'ICV', 'Comment', 'Appendiceal orifice'],
               ['Transverse' ,'Appendiceal orifice', 'ICV', 'Cecum Landmark', 'Comment'],['Rectum RF' ,'ICV' ,'Cecum Landmark', 'Comment' ,'TI' ,'Transverse','Appendiceal orifice'],
               ['Rectum RF', 'Transverse' ,'ICV' ,'Cecum Landmark', 'Comment'],['Comment']]}
data=pd.DataFrame(data)

dataframe should like this

        task    m_label
0   S101-10061  [Cecum Landmark, ICV, Comment, Appendiceal ori...
1   S101-10069  [Rectum RF, ICV, Cecum Landmark, TI, Comment, ...
2   S101-10078  [Appendiceal orifice, ICV, Cecum Landmark, Com...
3   S101-10088  [Cecum Landmark, ICV, Comment, Appendiceal ori...
4   S101-10100  [Transverse, Appendiceal orifice, ICV, Cecum L...
5   S101-10102  [Rectum RF, ICV, Cecum Landmark, Comment, TI, ...
6   S101-10133  [Rectum RF, Transverse, ICV, Cecum Landmark, C...
7   S101YGBgZ2  [Comment]

output code

import numpy as np
data=pd.concat([data["task"],data["m_label"].apply(lambda x:pd.Series(x).add_prefix("m_label"))],axis=1).replace(np.nan," ")

task    m_label0    m_label1    m_label2    m_label3    m_label4    m_label5    m_label6
0   S101-10061  Cecum Landmark  ICV Comment Appendiceal orifice         
1   S101-10069  Rectum RF   ICV Cecum Landmark  TI  Comment Transverse  
2   S101-10078  Appendiceal orifice ICV Cecum Landmark  Comment Transverse  Rectum RF   
3   S101-10088  Cecum Landmark  ICV Comment Appendiceal orifice         
4   S101-10100  Transverse  Appendiceal orifice ICV Cecum Landmark  Comment     
5   S101-10102  Rectum RF   ICV Cecum Landmark  Comment TI  Transverse  Appendiceal orifice
6   S101-10133  Rectum RF   Transverse  ICV Cecum Landmark  Comment     
7   S101YGBgZ2  Comment     

Upvotes: 1

ThePyGuy
ThePyGuy

Reputation: 18416

Use str.findall and pass the regex to capture everything enclosed by single quite '', then apply pd.Series to convert them to columns

df=df.set_index('task')['m_label'].str.findall('\'(.*?)\'').apply(pd.Series)
df.columns = [f'm_label{i+1}' for i in df]

OUTPUT:

                       m_label1             m_label2        m_label3               m_label4    m_label5    m_label6             m_label7  
task                                                                                                                                       
S101-10061       Cecum Landmark                  ICV         Comment    Appendiceal orifice         NaN         NaN                  NaN   
S101-10069            Rectum RF                  ICV  Cecum Landmark                     TI     Comment  Transverse                  NaN   
S101-10078  Appendiceal orifice                  ICV  Cecum Landmark                Comment  Transverse   Rectum RF                  NaN   
S101-10088       Cecum Landmark                  ICV         Comment    Appendiceal orifice         NaN         NaN                  NaN   
S101-10100           Transverse  Appendiceal orifice             ICV         Cecum Landmark     Comment         NaN                  NaN   
S101-10102            Rectum RF                  ICV  Cecum Landmark                Comment          TI  Transverse  Appendiceal orifice   
S101-10133            Rectum RF           Transverse             ICV         Cecum Landmark     Comment         NaN                  NaN   
S101YGBgZ2              Comment                  NaN             NaN                    NaN         NaN         NaN                  NaN   
                

If needed, you can reset the index later, and fillna('').

Upvotes: 2

BlackMath
BlackMath

Reputation: 1858

Just to add something to ThePyGuy's answer,if you want to rename columns "on the fly", you can use add_prefix().

df.set_index('task')['m_label'].str.findall('\'(.*?)\'').apply(pd.Series).add_prefix('m_label')

output:

Out[27]: 
                  m_label0 m_label1  ... m_label4    m_label5
task                                 ...                     
S101-10061  Cecum Landmark      ICV  ...      NaN         NaN
S101-10069       Rectum RF      ICV  ...  Comment  Transverse

Upvotes: 0

Related Questions