Michael H
Michael H

Reputation: 131

Regex strategy works in one case but not another

Given this data on Python 3.7.3 and Pandas 0.25.0

import pandas as pd
test = {'data':['1/2 lorem ipsum','2/3 ipsum lorem 4/5','6/7 lorem ipsum',
'8.2/9 ipsum lorem 10.12/13']}
df = pd.DataFrame(test)

I want to separately extract numerators and denominators, and only ever consider the last fraction, so either the only one given or the second if there are two. There are never more than two and between them is some text.

I got this to work for the denominators, extracting any numbers after the foreslash:

print(df.data.str.extract('(?:.*\/(\d+)){0}.*\/(\d+)')[1])
0    2 
1    5 
2    7 
3    13
Name: 1, dtype: object

I've not been able to get this to work for the numerators, possibly because of the complication that they can include decimals. The closest I get is this, using code that looks like the one above, plus handling of possible decimals:

df.data.str.extract('(?:((?:\d+\.)?\d+)\/){0}(?:((?:\d+\.)?\d+)\/)')[1]
0      1
1      2
2      6
3    8.2
Name: 1, dtype: object

It picks up the decimal correctly but brings back results for the first fractions only. The expected numbers would be 1,4,6,10.12

After trying countless variations of the code I'm stuck and hope the error can be found.

Upvotes: 0

Views: 110

Answers (3)

Chris Doyle
Chris Doyle

Reputation: 12145

A more pandas styled code with a simple regex for fractions.

import pandas as pd
test = {'data':[
    '1/2 lorem ipsum',
    '2/3 ipsum lorem 4/5',
    '6/7 lorem ipsum',
    '8.2/9 ipsum lorem 10.12/13']}
df = pd.DataFrame(test)
fractions = df.data.str.extractall('(\d+\.?\d*)/(\d+)').groupby(level=0).tail(1)
numerators = fractions[0].tolist()
denominators = fractions[1].tolist()
print("Numerators:",numerators,"\nDenominators",denominators)

OUTPUT

Numerators: ['1', '4', '6', '10.12'] 
Denominators ['2', '5', '7', '13']

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627101

You may use

>>> df.data.str.extract(r'(?:.*\D)?(?<!\d\.)(\d+(?:\.\d+)?)/(\d+(?:\.\d+)?)')
       0   1
0      1   2
1      4   5
2      6   7
3  10.12  13

See the regex demo.

Details

  • (?:.*\D)? - an optional string of any 0+ chars other than line break chars as many as possible up to the non-digit...
  • (?<!\d\.) - not immediately preceded with a digit and a dot...
  • (\d+(?:\.\d+)?) - Capturing group 1: 1+ digits and an opptional sequence of . and 1+ digits
  • / - a /
  • (\d+(?:\.\d+)?) - Capturing group 2: 1+ digits and an opptional sequence of . and 1+ digits.

If you need to get the values of last fraction in the string separately remove the unnecessary groupings:

>>> df.data.str.extract(r'(?:.*\D)?(?<!\d\.)(\d+(?:\.\d+)?)/\d+(?:\.\d+)?')
       0
0      1
1      4
2      6
3  10.12
>>> df.data.str.extract(r'(?:.*\D)?(?<!\d\.)\d+(?:\.\d+)?/(\d+(?:\.\d+)?)')
    0
0   2
1   5
2   7
3  13

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336408

I'd suggest the following regex:

(\d+(?:\.\d+)?)/(\d+(?:\.\d+)?)(?!.*\d+(?:\.\d+)?/\d+(?:\.\d+)?)

This will match a fraction as long as it's not followed by another fraction within the same string.

Test it live on regex101.com.

Explanation:

(\d+(?:\.\d+)?) # Match a number, optionally followed by a decimal part
/               # Match a slash
(\d+(?:\.\d+)?) # Match another number
(?!             # only if it's not possible to match...
 .*             # any string
 \d+(?:\.\d+)?  # followed by a number,
 /              # a slash
 \d+(?:\.\d+)?  # and another number.
)               # (End of lookahead assertion)

Upvotes: 4

Related Questions