Alexander
Alexander

Reputation: 4635

Extracting digits after certain characters in pandas dataframe

I have a dataframe and one of the column contains 'weak=30' type strings and I want to extract the digits after = string and create new column named digits.

I use re.search to find the digits but so far its giving an error.

Example data

import pandas as pd
import re

raw_data = {'patient': [1, 2, 3,4, 6],
        'treatment': [0, 1, 0, 1, 0],
        'score': ['strong=42', 'weak=30', 'weak=12', 'pitt=12', 'strong=42']}

df = pd.DataFrame(raw_data, columns = ['patient', 'treatment', 'score'])

df

   patient  treatment      score
0        1          0  strong=42
1        2          1    weak=30
2        3          0    weak=12
3        4          1    pitt=12
4        6          0  strong=42

So I tried

df=df.assign(digits=[int(re.search(r'\d+', x)) for x in df.score])

TypeError: int() argument must be a string, a bytes-like object or a number, not 're.Match'

In R simply you can do

mutate(digits=as.numeric(gsub(".*=","",score))

What would be the equivalent function in python pandas ?

Expected output

   patient  treatment      score   digits
0        1          0  strong=42     42
1        2          1    weak=30     30
2        3          0    weak=12     12
3        4          1    pitt=12     12
4        6          0  strong=42     42

Upvotes: 1

Views: 2360

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626728

You may just use str.replace with your R regex:

df['digits'] = df['score'].str.replace(r'.*=', '').astype(int)

The .*= pattern matches all 0+ chars other than line break chars as many as possible up to the last = and replaceing with '' removes this text.

Or, you may use the approach with extracting digits after = at the end of the string:

df['digits'] = df['score'].str.extract(r'=(\d+)$', expand=False).astype(int)

Here, =(\d+)$ matches =, then captures into Group 1 any one or more digits and then asserts the position at the end of the string.

Output in both cases is:

>>> df
   patient  treatment      score  digits
0        1          0  strong=42      42
1        2          1    weak=30      30
2        3          0    weak=12      12
3        4          1    pitt=12      12
4        6          0  strong=42      42

Upvotes: 2

B Man
B Man

Reputation: 99

The re.search returns a MatchObject and not directly the matched string. See https://docs.python.org/3.7/library/re.html#match-objects

If you want the string you could try something along the lines of:

re.search(r'\d+', x).group(0)

Upvotes: 0

Related Questions