Contrapunto
Contrapunto

Reputation: 172

RegEx for extracting a decimal number

I have a pandas df where a column is a text with ratings in a format of X/10. I want to extract the numerators (which can be decimals). So far I was using:

my_df.text_column.str.extract('(\d*?\.?\d+(?=/10))')

I thought I was doing fine until I saw that I had some numerators like .10. What is actually happening is some rows have text like: "Nice job.10/10".

How can I specify that when extracting a number from this column, in case it detected a "." it must have came after a digit?

Thanks.

Upvotes: 2

Views: 1655

Answers (3)

user557597
user557597

Reputation:

The simplest way (\d+(?:\.\d*)?(?=/10))


Sample

Nice job.10/10".
 "0.10/10", then it would be "0.10" 

Benchmark

Regex1:   (\d+(?:\.\d*)?(?=/10))
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   2
Elapsed Time:    1.04 s,   1038.38 ms,   1038383 µs
Matches per sec:   96,303

Upvotes: 1

TimWeri
TimWeri

Reputation: 25

I would separate the numerator pattern into two cases: one with "." and one without ".".

  • Numerator with ".": \d+\.\d+
  • Numerator without ".": \d+

Thus, the pattern for the numerator would be (\d+\.\d+|\d+). Putting everything together, we have (\d+\.\d+|\d+)/\d+.

The order of two parts matter if the regexp engine does not prioritize longer matches. Putting the longer alternative first will give you the longest match. If the numerator is a decimal number, the whole numerator (\d+\.\d+) is chosen instead of just the decimal part (\d+).

Upvotes: 1

R. Schifini
R. Schifini

Reputation: 9313

Do:

df.text.str.extract(r'(\d+\.?\d*?(?=/10))')

You want to first look for a number (\d+) followed by an optional (\.?) and an optional decimal (\d*?)

Example:

df = pd.DataFrame({'text':["Nice Job.10/10", "Score 9.5/10", "And now 5./10"]})
df.text.str.extract(r'(\d+\.?\d*?(?=/10))')



    0
0   10
1   9.5
2   5.

Upvotes: 2

Related Questions