Reputation: 172
I have a pandas df where a column is a text with ratings in a format of X/10. I want to extract the numerators (which can be decimals). So far I was using:
my_df.text_column.str.extract('(\d*?\.?\d+(?=/10))')
I thought I was doing fine until I saw that I had some numerators like .10
. What is actually happening is some rows have text like: "Nice job.10/10".
How can I specify that when extracting a number from this column, in case it detected a "." it must have came after a digit?
Thanks.
Upvotes: 2
Views: 1655
Reputation:
The simplest way (\d+(?:\.\d*)?(?=/10))
Sample
Nice job.10/10".
"0.10/10", then it would be "0.10"
Benchmark
Regex1: (\d+(?:\.\d*)?(?=/10))
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 2
Elapsed Time: 1.04 s, 1038.38 ms, 1038383 µs
Matches per sec: 96,303
Upvotes: 1
Reputation: 25
I would separate the numerator pattern into two cases: one with "." and one without ".".
\d+\.\d+
\d+
Thus, the pattern for the numerator would be (\d+\.\d+|\d+)
.
Putting everything together, we have (\d+\.\d+|\d+)/\d+
.
The order of two parts matter if the regexp engine does not prioritize longer matches. Putting the longer alternative first will give you the longest match. If the numerator is a decimal number, the whole numerator (\d+\.\d+
) is chosen instead of just the decimal part (\d+
).
Upvotes: 1
Reputation: 9313
Do:
df.text.str.extract(r'(\d+\.?\d*?(?=/10))')
You want to first look for a number (\d+
) followed by an optional (\.?
) and an optional decimal (\d*?
)
Example:
df = pd.DataFrame({'text':["Nice Job.10/10", "Score 9.5/10", "And now 5./10"]})
df.text.str.extract(r'(\d+\.?\d*?(?=/10))')
0
0 10
1 9.5
2 5.
Upvotes: 2