BhishanPoudel
BhishanPoudel

Reputation: 17144

Multiple regex replacements with pandas

I have a pandas series of various age ranges:

s = pd.Series([14,1524,2534,3544,65])

I would like to create a new series like this:

0     0-14
1    15-24
2    25-34
3    35-44
4      65+

I can do this using mapping:

s = pd.Series([14,1524,2534,3544,65])
age_map = {
    14: '0-14',
    1524: '15-24',
    2534: '25-34',
    3544: '35-44',
    4554: '45-54',
    5564: '55-64',
    65: '65+'
}
s.map(age_map)

Also, using multiple regexes:

s = pd.Series([14,1524,2534,3544,65])
s = s.astype(str).str.replace(r'(\d\d)(\d\d)', r'\1-\2',regex=True)
s = s.astype(str).str.replace(r'14', r'0-14',regex=True)
s = s.astype(str).str.replace(r'65', r'65+',regex=True)
s

Question
Can we combine all three regexes into one advanced regex and obtain the same result?

something like:

s = pd.Series([14,1524,2534,3544,65])
pat = ''
pat_sub = ''
s = s.astype(str).str.replace(pat, pat_sub,regex=True)
s

Upvotes: 1

Views: 318

Answers (2)

BhishanPoudel
BhishanPoudel

Reputation: 17144

I liked the answer of @coldspeed which is more flexible and function is reusable.

However, I came up with pandas chain operation like this:

s = s.astype(str).str.replace(r'14', r'0-14',regex=True)
                 .str.replace(r'65', r'65+',regex=True)
                 .str.replace(r'(\d\d)(\d\d)', r'\1-\2',regex=True))

s

Upvotes: 1

cs95
cs95

Reputation: 402463

You can use a single callback function to handle all the cases:

def parse_str(match):
    a, b = match.groups()
    if not b:
        return f'0-{a}' if a == '14' else f'{a}+'    
    return f'{a}-{b}'

s.astype(str).str.replace(r'(\d{2})(\d{2})?', parse_str)

0     0-14
1    15-24
2    25-34
3    35-44
4      65+
dtype: object

This should work assuming your Series contains only either two or four digit numbers.

Upvotes: 3

Related Questions