Reputation: 4635
I have a dataframe and one of the column contains 'weak=30'
type strings and I want to extract the digits after =
string and create new column named digits
.
I use re.search
to find the digits but so far its giving an error.
Example data
import pandas as pd
import re
raw_data = {'patient': [1, 2, 3,4, 6],
'treatment': [0, 1, 0, 1, 0],
'score': ['strong=42', 'weak=30', 'weak=12', 'pitt=12', 'strong=42']}
df = pd.DataFrame(raw_data, columns = ['patient', 'treatment', 'score'])
df
patient treatment score
0 1 0 strong=42
1 2 1 weak=30
2 3 0 weak=12
3 4 1 pitt=12
4 6 0 strong=42
So I tried
df=df.assign(digits=[int(re.search(r'\d+', x)) for x in df.score])
TypeError: int() argument must be a string, a bytes-like object or a number, not 're.Match'
In R simply you can do
mutate(digits=as.numeric(gsub(".*=","",score))
What would be the equivalent function in python pandas
?
Expected output
patient treatment score digits
0 1 0 strong=42 42
1 2 1 weak=30 30
2 3 0 weak=12 12
3 4 1 pitt=12 12
4 6 0 strong=42 42
Upvotes: 1
Views: 2360
Reputation: 626728
You may just use str.replace
with your R regex:
df['digits'] = df['score'].str.replace(r'.*=', '').astype(int)
The .*=
pattern matches all 0+ chars other than line break chars as many as possible up to the last =
and replace
ing with ''
removes this text.
Or, you may use the approach with extracting digits after =
at the end of the string:
df['digits'] = df['score'].str.extract(r'=(\d+)$', expand=False).astype(int)
Here, =(\d+)$
matches =
, then captures into Group 1 any one or more digits and then asserts the position at the end of the string.
Output in both cases is:
>>> df
patient treatment score digits
0 1 0 strong=42 42
1 2 1 weak=30 30
2 3 0 weak=12 12
3 4 1 pitt=12 12
4 6 0 strong=42 42
Upvotes: 2
Reputation: 99
The re.search returns a MatchObject and not directly the matched string. See https://docs.python.org/3.7/library/re.html#match-objects
If you want the string you could try something along the lines of:
re.search(r'\d+', x).group(0)
Upvotes: 0