Reputation: 1007
I have the following dataframe where I want to split the Col2 into multiple columns:
Input DataFrame:
>>> mydf= pd.DataFrame({'Col1':['AA','AB','AAC'], 'Col2':['AN||Ind(0.9)','LN||RED(8.9)','RN||RED(9.8)'], 'Col3':['log1','log2','log3']})
>>> mydf
Col1 Col2 Col3
0 AA AN||Ind(0.9) log1
1 AB LN||RED(8.9) log2
2 AAC RN||RED(9.8) log3
Desired DataFrame:
Col1 Col2 Col3 Col4 Col5
0 AA AN log1 Ind 0.9
1 AB LN log2 RED 8.9
2 AAC RN log3 RED 9.8
I started with Apply but the following will take a good few steps. Any shortcut?
mydf['Col4']=mydf['Col2'].apply(lambda x: str(x).split('||')[0])
Also little confused why the following throws a valuerror:
mydf['Col2'].str.split('||', expand=True)
ValueError: split() requires a non-empty pattern match.
Upvotes: 2
Views: 300
Reputation: 210842
Using RegEx from the great @piRSquared's solution
In [59]: regex = '(?P<Col2>.*)\|{2,}(?P<Col4>.*)\((?P<Col5>.*)\)'
In [60]: mydf = mydf.join(mydf.pop('Col2').str.extract(regex, expand=True)) \
.sort_index(axis=1)
In [61]: mydf
Out[61]:
Col1 Col2 Col3 Col4 Col5
0 AA AN log1 Ind 0.9
1 AB LN log2 RED 8.9
2 AAC RN log3 RED 9.8
Upvotes: 1
Reputation: 38415
@piRSquared answer is amazing as usual and upvoted, I am just posting my approach. I just kept it very simple
mydf[['Col2', 'Col4', 'Col5']]= mydf.Col2.str.extract('(.*?)\|\|(.*?)\((.*?)\)', expand = True)
Col2 is automatically reassigned so no need to drop a column later.
Col1 Col2 Col3 Col4 Col5
0 AA AN log1 Ind 0.9
1 AB LN log2 RED 8.9
2 AAC RN log3 RED 9.8
Upvotes: 2
Reputation: 294258
You can split out the columns with str.extract
and assign
regex = '(?P<Col2>.*)\|{2,}(?P<Col4>.*)\((?P<Col5>.*)\)'
mydf.assign(**mydf.Col2.str.extract(regex, expand=True).to_dict('list'))
Col1 Col2 Col3 Col4 Col5
0 AA AN log1 Ind 0.9
1 AB LN log2 RED 8.9
2 AAC RN log3 RED 9.8
Or equivalently with combine_first
regex = '(?P<Col2>.*)\|{2,}(?P<Col4>.*)\((?P<Col5>.*)\)'
mydf.Col2.str.extract(regex, expand=True).combine_first(mydf)
Col1 Col2 Col3 Col4 Col5
0 AA AN log1 Ind 0.9
1 AB LN log2 RED 8.9
2 AAC RN log3 RED 9.8
explanation
This uses a regular expression to parse the Col2
values and assign column names at the same time
regex = '(?P<Col2>.*)\|{2,}(?P<Col4>.*)\((?P<Col5>.*)\)'
'(?P<Col2>.*)\|{2,}'
will grab everything up to the first double |
and call it Col2
'(?P<Col4>.*)'
grabs everything up to the parentheses and calls it Col4
'\((?P<Col5>.*)\)'
grabs everything inside the parentheses and calls it Col5
Col2
overwriting the existing Col2
or we use combine_first
where we default to the newly formulated Col2
values.Upvotes: 4