Reputation: 511
I have a dataframe like this:
Id Column Val1 Val2
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0
2 Cust=acc hit,Data=125 3.0 400.0
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0
5 Cust=evc,Region Info=atz 2.0 NaN
I want to convert the dataframe to this:
Id Column Val1 Val2 Cust Region Info Data
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc xyz 123.0
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd xyz 124.0
2 Cust=acc hit,Data=125 3.0 400.0 acc hit NaN 125.0
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc xyz 126.0
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss xaz 127.0
5 Cust=evc,Region Info=atz 2.0 NaN evc atz NaN
From this other question, I got a partial answer.
But how can I handle the spaces in the key and value?
Edit: There may be multiple key-value pairs (other than the ones shown in the example). So I need to handle cases for any 'n' number of columns.
Upvotes: 0
Views: 667
Reputation: 71689
Series.str.findall
We can use str.findall
with regex capture groups to extract key-value
pairs from the Id Column
column
df.join(pd.DataFrame(map(dict, df['Id Column'].str.findall(r'([^=,]+)=([^,]+)'))))
Id Column Val1 Val2 Cust Region Info Data
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc xyz 123
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd xyz 124
2 Cust=acc hit,Data=125 3.0 400.0 acc hit NaN 125
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc xyz 126
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss xaz 127
5 Cust=evc,Region Info=atz 2.0 NaN evc atz NaN
Regex
details
([^=,]+)
: first capturing group
[^=,]+
: Matches any character not present in the list [=,]
one or more times=
: Matches the =
character literally([^,]+)
: Second capturing group
[^,]+
: Matches any character not present in the list [,]
one or more timesSee the online regex demo
Upvotes: 3
Reputation: 133428
With your shown samples only, could please try following.
import pandas as pd
df[["Cust Region","Info","Data"]] = df["IdColumn"].str.extract(r'^Cust=([^,]+)(?:,Region Info=([^,]*))?(?:,Data=(.*))?$', expand=True)
df
Here is the Online demo for used regex
Output will be as follows:
IdColumn Val1 Val2 Cust Region Info Data
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc xyz 123
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd xyz 124
2 Cust=acc hit,Data=125 3.0 400.0 acc hit NaN 125
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc xyz 126
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss xaz 127
5 Cust=evc,Region Info=atz 2.0 NaN evc atz NaN
Explanation: Adding detailed explanation for above regex.
^Cust= ##Checking if value starts from Cust= here.
([^,]+) ##Creating 1st capturing group which has all values till , here.
(?:,Region Info= ##Starting a non-capturing group , Region Info= here.
([^,]*) ##Creating 2nd capturing group which has all values till , here.
)? ##Closing non-capturing group here.
(?:,Data= ##Creating non-capturing group which has ,Data= here.
(.*) ##Creating 3rd capturing group which has all values till end of value here.
)?$ ##Closing non-capturing group here at the end of line.
Upvotes: 2
Reputation: 862481
Use list comprehension with split by ,
and then by =
for list of dictionaries, so possible pass to DataFrame
constructor:
L = [dict([y.split('=') for y in x.split(',')]) for x in df['Id Column']]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
Id Column Val1 Val2 Cust Region Info \
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc xyz
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd xyz
2 Cust=acc hit,Data=125 3.0 400.0 acc hit NaN
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc xyz
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss xaz
5 Cust=evc,Region Info=atz 2.0 NaN evc atz
Data
0 123
1 124
2 125
3 126
4 127
5 NaN
Upvotes: 1
Reputation: 29992
Use apply()
on column Id Column
and get the value by splitting.
df['Cust Region'] = df['Id Column'].apply(lambda x: x.split(',')[0].split('=')[-1])
# print(df)
Id Column Val1 Val2 Cust Region
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd
2 Cust=acc hit,Data=125 3.0 400.0 acc hit
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss
5 Cust=evc,Region Info=atz 2.0 NaN evc
Upvotes: 1