Laz
Laz

Reputation: 113

split a series in pandas pipe separated (with equal sign) into multiple columns

I'm relatively new to Python and would like to seek your help in the following problem. My current series of information looks like this:


df['word_feats']
------------------------------------------------
0                                                   Case=Loc|Gender=Neut|Number=Sing|Person=3
1                                                   Case=Nom|Gender=Neut|Number=Sing|Person=3
2                                                                              PunctType=Comm
3                                                   Case=Nom|Gender=Neut|Number=Sing|Person=3
4                                                   Case=Nom|Gender=Neut|Number=Sing|Person=3
5                                                                                        None
6                                                   Case=Nom|Gender=Neut|Number=Sing|Person=3
7                                                   Case=Loc|Gender=Neut|Number=Sing|Person=3
8                                                            Gender=Neut|Number=Sing|Person=3
9                                                   Case=Loc|Gender=Neut|Number=Plur|Person=3
10                                                                                       None

My results in a single column are not in a dictionary format (which would be easier). Instead, it is in a pipe separated format with equals sign.

I am hoping to split this series into multiple columns, taking the 'key' (in this case, LEFT of equal sign) as the column name, and 'value' (RIGHT of equal sign) as my cell.

I've tried something along this line:

df['word_feats'].str.split('|', expand=True)

Which doesn't work as:

  1. It doesn't take the 'key' as column name
  2. My values are in the wrong columns!

Would appreciate any answers for this! Thanks.

Upvotes: 2

Views: 547

Answers (2)

Laz
Laz

Reputation: 113

I have overcome this issue by doing the following:

Step 1: Convert series to dictionary:

word_feats_dict=[]
for i in df['word_feats']:
    
    if i== None:
        q='None'
    else: 
        q = dict(item.split('=') for item in i.split('|'))
    
    word_feats_dict.append(q)

df['word_feats_dict']=word_feats_dict              # creating another series that is in dictionary form.

Step 2: Explode dictionary into Multiple columns

df['word_feats_dict'].apply(pd.Series)
df = pd.concat([df.drop(['word_feats_dict'],axis=1),df['word_feats_dict'].apply(pd.Series)],axis=1)
df.drop(columns=0,inplace=True)    # As I have 'None' as some of my values, they turned out to be a column on itself called '0'.

Upvotes: 0

Shubham Sharma
Shubham Sharma

Reputation: 71687

We can use a regular expression pattern to find all the occurrences of key-value pairs from each row, this will generate a list of tuples in every row, then map each list to dict and construct a dataframe from the mapped records

pd.DataFrame(map(dict, df['word_feats'].str.findall(r'([^|=]+)=([^|]+)')))

   Case Gender Number Person PunctType
0   Loc   Neut   Sing      3       NaN
1   Nom   Neut   Sing      3       NaN
2   NaN    NaN    NaN    NaN      Comm
3   Nom   Neut   Sing      3       NaN
4   Nom   Neut   Sing      3       NaN
5   NaN    NaN    NaN    NaN       NaN
6   Nom   Neut   Sing      3       NaN
7   Loc   Neut   Sing      3       NaN
8   NaN   Neut   Sing      3       NaN
9   Loc   Neut   Plur      3       NaN
10  NaN    NaN    NaN    NaN       NaN

Upvotes: 3

Related Questions