madmax80
madmax80

Reputation: 181

How to split messy string to letters and numbers using Regex in Python

I have these two strings:

x ='plasma_glucose_concentration183.0000'
y = 'Participants20-30'

And want to split the strings as follow:

x: ['plasma_glucose_concentration', '183.0000']
y: ['Participants, 20-30']

I created this function, but only first string is split correctly:

def split_string(x):
    res = re.findall(r"(\w+?)(\d*\.\d+|\d+)", x)
    return res

When I split the second string I get:

  [('Participants', '20'), ('3', '0')]

Is there any regex solution for this? Thx.

Upvotes: 2

Views: 94

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627103

You can use

import re

x = ['plasma_glucose_concentration183.0000', 'Participants20-30','2_hour_serum_insulin543.0000']
for s in x:
    print(re.split(r'(?<=[^\W\d_])(?=\d)|(?<=\d)(?=[^\W\d_])', s))

# => ['plasma_glucose_concentration', '183.0000']
#    ['Participants', '20-30']
#    ['2_hour_serum_insulin', '543.0000']

See the regex demo and the Python demo online.

The (?<=[^\W\d_])(?=\d)|(?<=\d)(?=[^\W\d_]) regex splits a string between a letter and a digit or between a digit and a letter.

Pandas test:

>>> import pandas as pd
>>> df = pd.DataFrame({'text':['plasma_glucose_concentration183.0000','Participants20-30','2_hour_serum_insulin543.0000']})
>>> df['text'].str.split(r'(?<=[^\W\d_])(?=\d)|(?<=\d)(?=[^\W\d_])')
0    [plasma_glucose_concentration, 183.0000]
1                       [Participants, 20-30]
2            [2_hour_serum_insulin, 543.0000]
Name: text, dtype: object

Upvotes: 1

Related Questions