Royale_w_cheese
Royale_w_cheese

Reputation: 297

Use regex to identify 4 to 5 numbers that are (consecutive, i.e no whitespace or special characters included), without including preceding 0's

I am trying to use regular expressions to identify 4 to 5 digit numbers. The code below is working effectively in all cases unless there are consecutive 0's preceding a one, two or 3 digit number. I don't want '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000' to all be matches. Is there a good way to implement this using regular expressions? Here is my current code that works for most cases except when there are preceding 0's to a series of digits less than 4 or 5 characters in length.

import re

line = 'US Machine Operations | 0054'
match = re.search(r'\d{4,5}', line)
if match is None:
    print(0)
else:
    print(int(match[0]))

Upvotes: 1

Views: 68

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627101

You may use

(?<!\d)[1-9]\d{3,4}(?!\d)

See the regex demo.

NOTE: In Pandas str.extract, you must wrap the part you want to be returned with a capturing group, a pair of unescaped parentheses. So, you need to use

(?<!\d)([1-9]\d{3,4})(?!\d)
       ^            ^

Example:

df2['num_col'] = df2.Warehouse.str.extract(r'(?<!\d)([1-9]\d{3,4})(?!\d)', expand = False).astype(float)

Just because you can simple use a capturing group, you may use an equivalent regex:

(?:^|\D)([1-9]\d{3,4})(?!\d)

Details

  • (?<!\d) - no digit immediately to the left
  • or (?:^|\D) - start of string or non-digit char (a non-capturing group is used so that only 1 capturing group could be accommodated in the pattern and let str.extract only extract what needs extracting)
  • [1-9] - a non-zero digit
  • \d{3,4} - three or four digits
  • (?!\d) - no digit immediately to the right is allowed

Python demo:

import re
s = "US Machine Operations | 0054 '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000'"
print(re.findall(r'(?<!\d)[1-9]\d{3,4}(?!\d)', s))
# => ['10354', '10032', '9005', '9000']

Upvotes: 3

Related Questions