Reputation: 1823

regex | extract numbers preceded by defined strings

I have strings like:

Bla bla 0.75 oz. Bottle
Mugs, 8oz. White
Bowls, 4.4" dia x 2.5", 12ml. Natural
Ala bala 3.3" 30ml Bottle'

I want to extract the numeric value which occurs before my pre-defined lookaheads, in this case [oz, ml]

0.75 oz
8 oz
12 ml
30 ml

I have the below code:

import re
import pandas as pd
look_ahead = "oz|ml"

s = pd.Series(['Bla bla 0.75 oz. Bottle',
              'Mugs, 8oz. White',
              'Bowls, 4.4" dia x 2.5", 12ml. Natural',
              'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
        rf"((?!,)[0-9]+.*[0-9]* *(?={look_ahead})[a-zA-Z]+)")

print(size_and_units)

Which outputs this:

0                  [0.75 oz]
1                      [8oz]
2    [4.4" dia x 2.5", 12ml]
3                [3.3" 30ml]

You can see there is a mismatch between what I want as output and what I am getting from my script. I think my regex code is picking everything between first numeric value and my defined lookahead, however I only want the last numeric value before my lookahead.

I am out of my depth for regex. Can someone help fix this. Thank you!

Upvotes: 0

Answers (3)

The fourth bird

Reputation: 163287

Some notes about the pattern that you tried:

You can omit the lookahead (?!,) as it is always true because you start the next match for a digit
In this part .*[0-9]* *(?=oz|ml)[a-zA-Z]+) this is all optional .*[0-9]* * and will match until the end of the string. Then it will backtrack till it can match either oz or ml and will match 1 or more chars a-zA-Z so it could also match 0.75 ozaaaaaaa

If you want the matches, you don't need a capture group or lookarounds. You can match:

\b\d+(?:\.\d+)*\s*(?:oz|ml)\b

\b A word boundary to prevent a partial word match
\d+(?:\.\d+)* Match 1+ digits with an optional decimal part
\s*(?:oz|ml) Match optional whitespace chars and either oz or ml
\b A word boundary

Regex demo

import pandas as pd

look_ahead = "oz|ml"

s = pd.Series(['Bla bla 0.75 oz. Bottle',
               'Mugs, 8oz. White',
               'Bowls, 4.4" dia x 2.5", 12ml. Natural',
               'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
    rf"\b\d+(?:\.\d+)*\s*(?:{look_ahead})\b")

print(size_and_units)

Output

0    [0.75 oz]
1        [8oz]
2       [12ml]
3       [30ml]

Upvotes: 1

Egemen Çiftci

Reputation: 744

I think that regex expression will work for you.

[0-9]+\.*[0-9]* *(oz|ml)

Upvotes: 0

pho

Reputation: 25489

Making as few changes to your regex, so you know what you did wrong: in [0-9]+.*[0-9]*, replace . with \.. . means any character. \. means a period.

s = pd.Series(['Bla bla 0.75 oz. Bottle',
              'Mugs, 8oz. White',
              'Bowls, 4.4" dia x 2.5", 12ml. Natural',
              'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
        rf"((?!,)[0-9]+\.*[0-9]* *(?={look_ahead})[a-zA-Z]+)")

gives:

0    [0.75 oz]
1        [8oz]
2       [12ml]
3       [30ml]

You don't need to use a lookahead at all though, since you also want to match the units. Just do

\d+\.*\d*\s*(?:oz|ml)

This gives the same result:

size_and_units = s.str.findall(
        rf"\d+\.*\d*\s*(?:{look_ahead})")

Upvotes: 1

regex | extract numbers preceded by defined strings

Answers (3)

Related Questions