Reputation: 1823
I have strings like:
Bla bla 0.75 oz. Bottle
Mugs, 8oz. White
Bowls, 4.4" dia x 2.5", 12ml. Natural
Ala bala 3.3" 30ml Bottle'
I want to extract the numeric value which occurs before my pre-defined lookaheads, in this case [oz, ml]
0.75 oz
8 oz
12 ml
30 ml
I have the below code:
import re
import pandas as pd
look_ahead = "oz|ml"
s = pd.Series(['Bla bla 0.75 oz. Bottle',
'Mugs, 8oz. White',
'Bowls, 4.4" dia x 2.5", 12ml. Natural',
'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
rf"((?!,)[0-9]+.*[0-9]* *(?={look_ahead})[a-zA-Z]+)")
print(size_and_units)
Which outputs this:
0 [0.75 oz]
1 [8oz]
2 [4.4" dia x 2.5", 12ml]
3 [3.3" 30ml]
You can see there is a mismatch between what I want as output and what I am getting from my script. I think my regex code is picking everything between first numeric value and my defined lookahead, however I only want the last numeric value before my lookahead.
I am out of my depth for regex. Can someone help fix this. Thank you!
Upvotes: 0
Views: 848
Reputation: 163287
Some notes about the pattern that you tried:
(?!,)
as it is always true because you start the next match for a digit.*[0-9]* *(?=oz|ml)[a-zA-Z]+)
this is all optional .*[0-9]* *
and will match until the end of the string. Then it will backtrack till it can match either oz
or ml
and will match 1 or more chars a-zA-Z so it could also match 0.75 ozaaaaaaa
If you want the matches, you don't need a capture group or lookarounds. You can match:
\b\d+(?:\.\d+)*\s*(?:oz|ml)\b
\b
A word boundary to prevent a partial word match\d+(?:\.\d+)*
Match 1+ digits with an optional decimal part\s*(?:oz|ml)
Match optional whitespace chars and either oz
or ml
\b
A word boundaryimport pandas as pd
look_ahead = "oz|ml"
s = pd.Series(['Bla bla 0.75 oz. Bottle',
'Mugs, 8oz. White',
'Bowls, 4.4" dia x 2.5", 12ml. Natural',
'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
rf"\b\d+(?:\.\d+)*\s*(?:{look_ahead})\b")
print(size_and_units)
Output
0 [0.75 oz]
1 [8oz]
2 [12ml]
3 [30ml]
Upvotes: 1
Reputation: 744
I think that regex expression will work for you.
[0-9]+\.*[0-9]* *(oz|ml)
Upvotes: 0
Reputation: 25489
Making as few changes to your regex, so you know what you did wrong:
in [0-9]+.*[0-9]*
, replace .
with \.
. .
means any character. \.
means a period.
s = pd.Series(['Bla bla 0.75 oz. Bottle',
'Mugs, 8oz. White',
'Bowls, 4.4" dia x 2.5", 12ml. Natural',
'Ala bala 3.3" 30ml Bottle'])
size_and_units = s.str.findall(
rf"((?!,)[0-9]+\.*[0-9]* *(?={look_ahead})[a-zA-Z]+)")
gives:
0 [0.75 oz]
1 [8oz]
2 [12ml]
3 [30ml]
You don't need to use a lookahead at all though, since you also want to match the units. Just do
\d+\.*\d*\s*(?:oz|ml)
This gives the same result:
size_and_units = s.str.findall(
rf"\d+\.*\d*\s*(?:{look_ahead})")
Upvotes: 1