Asif Iqbal
Asif Iqbal

Reputation: 511

Split dataframe Column based on key value pandas

I have a dataframe like this:

                               Id Column  Val1   Val2
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0
2                  Cust=acc hit,Data=125   3.0  400.0
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0
5               Cust=evc,Region Info=atz   2.0    NaN

I want to convert the dataframe to this:

                               Id Column  Val1   Val2     Cust Region Info   Data
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN      abc         xyz  123.0
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0      abd         xyz  124.0
2                  Cust=acc hit,Data=125   3.0  400.0  acc hit         NaN  125.0
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0      abc         xyz  126.0
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0  abg nss         xaz  127.0
5               Cust=evc,Region Info=atz   2.0    NaN      evc         atz    NaN

From this other question, I got a partial answer.

But how can I handle the spaces in the key and value?

Edit: There may be multiple key-value pairs (other than the ones shown in the example). So I need to handle cases for any 'n' number of columns.

Upvotes: 0

Views: 667

Answers (4)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Series.str.findall

We can use str.findall with regex capture groups to extract key-value pairs from the Id Column column

df.join(pd.DataFrame(map(dict, df['Id Column'].str.findall(r'([^=,]+)=([^,]+)'))))

                               Id Column  Val1   Val2     Cust Region Info Data
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN      abc         xyz  123
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0      abd         xyz  124
2                  Cust=acc hit,Data=125   3.0  400.0  acc hit         NaN  125
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0      abc         xyz  126
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0  abg nss         xaz  127
5               Cust=evc,Region Info=atz   2.0    NaN      evc         atz  NaN

Regex details

  • ([^=,]+): first capturing group
    • [^=,]+ : Matches any character not present in the list [=,] one or more times
  • = : Matches the = character literally
  • ([^,]+) : Second capturing group
    • [^,]+ : Matches any character not present in the list [,] one or more times

See the online regex demo

Upvotes: 3

RavinderSingh13
RavinderSingh13

Reputation: 133428

With your shown samples only, could please try following.

import pandas as pd
df[["Cust Region","Info","Data"]] = df["IdColumn"].str.extract(r'^Cust=([^,]+)(?:,Region Info=([^,]*))?(?:,Data=(.*))?$', expand=True)
df

Here is the Online demo for used regex

Output will be as follows:

                                IdColumn  Val1   Val2 Cust Region Info Data
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN         abc  xyz  123
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0         abd  xyz  124
2                  Cust=acc hit,Data=125   3.0  400.0     acc hit  NaN  125
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0         abc  xyz  126
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0     abg nss  xaz  127
5               Cust=evc,Region Info=atz   2.0    NaN         evc  atz  NaN

Explanation: Adding detailed explanation for above regex.

^Cust=              ##Checking if value starts from Cust= here.
([^,]+)             ##Creating 1st capturing group which has all values till , here.
(?:,Region Info=    ##Starting a non-capturing group , Region Info= here.
  ([^,]*)           ##Creating 2nd capturing group which has all values till , here.
)?                  ##Closing non-capturing group here.
(?:,Data=           ##Creating non-capturing group which has ,Data= here.
  (.*)              ##Creating 3rd capturing group which has all values till end of value here.
)?$                 ##Closing non-capturing group here at the end of line.

Upvotes: 2

jezrael
jezrael

Reputation: 862481

Use list comprehension with split by , and then by = for list of dictionaries, so possible pass to DataFrame constructor:

L = [dict([y.split('=') for y in x.split(',')]) for x in df['Id Column']]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
                               Id Column  Val1   Val2     Cust Region Info  \
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN      abc         xyz   
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0      abd         xyz   
2                  Cust=acc hit,Data=125   3.0  400.0  acc hit         NaN   
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0      abc         xyz   
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0  abg nss         xaz   
5               Cust=evc,Region Info=atz   2.0    NaN      evc         atz   

  Data  
0  123  
1  124  
2  125  
3  126  
4  127  
5  NaN  

Upvotes: 1

Ynjxsjmh
Ynjxsjmh

Reputation: 29992

Use apply() on column Id Column and get the value by splitting.

df['Cust Region'] = df['Id Column'].apply(lambda x: x.split(',')[0].split('=')[-1])
# print(df)

                               Id Column  Val1   Val2 Cust Region
0      Cust=abc,Region Info=xyz,Data=123  0.0     NaN         abc
1      Cust=abd,Region Info=xyz,Data=124  1.0   750.0         abd
2                  Cust=acc hit,Data=125  3.0   400.0     acc hit
3      Cust=abc,Region Info=xyz,Data=126  NaN   200.0         abc
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0     abg nss
5               Cust=evc,Region Info=atz  2.0     NaN         evc

Upvotes: 1

Related Questions