Reputation: 181
I have these two strings:
x ='plasma_glucose_concentration183.0000'
y = 'Participants20-30'
And want to split the strings as follow:
x: ['plasma_glucose_concentration', '183.0000']
y: ['Participants, 20-30']
I created this function, but only first string is split correctly:
def split_string(x):
res = re.findall(r"(\w+?)(\d*\.\d+|\d+)", x)
return res
When I split the second string I get:
[('Participants', '20'), ('3', '0')]
Is there any regex solution for this? Thx.
Upvotes: 2
Views: 94
Reputation: 627103
You can use
import re
x = ['plasma_glucose_concentration183.0000', 'Participants20-30','2_hour_serum_insulin543.0000']
for s in x:
print(re.split(r'(?<=[^\W\d_])(?=\d)|(?<=\d)(?=[^\W\d_])', s))
# => ['plasma_glucose_concentration', '183.0000']
# ['Participants', '20-30']
# ['2_hour_serum_insulin', '543.0000']
See the regex demo and the Python demo online.
The (?<=[^\W\d_])(?=\d)|(?<=\d)(?=[^\W\d_])
regex splits a string between a letter and a digit or between a digit and a letter.
Pandas test:
>>> import pandas as pd
>>> df = pd.DataFrame({'text':['plasma_glucose_concentration183.0000','Participants20-30','2_hour_serum_insulin543.0000']})
>>> df['text'].str.split(r'(?<=[^\W\d_])(?=\d)|(?<=\d)(?=[^\W\d_])')
0 [plasma_glucose_concentration, 183.0000]
1 [Participants, 20-30]
2 [2_hour_serum_insulin, 543.0000]
Name: text, dtype: object
Upvotes: 1