Digi
Digi

Reputation: 85

Regular expression help needed in python

Can anyone help me form a regex to identify the pattern dd-ddd as a whole word in a sentence e.g. in a sentence like this -

11-222 should be matched at the beginning of the sentence, as well as 33-444 in the middle but not 55-66-777 since the whole word does not match the pattern. If the pattern is present at the end, that should also be matched like 88-999

If I use a regex expression like '\b\d{2}-\d{3}\b' it even matches 66-777 which is within 55-66-777. I need to exclude that. Somehow, - (hyphen) is treated as a boundary for a word.

Any idea how I can achieve this?

Added sample code and output

import re
regex_str = r'\b\d{2}-\d{3}\b'
msg_message = '11-222 should be matched, as well as 33-444 but not 55-66-777. If it is present at the end, that should also be matched like 88-999'
for match in re.finditer(regex_str, msg_message):
    print('*'*15)
    print(match.group(0))
    print(match.span())

O/p

***************
11-222
(0, 6)
***************
33-444
(37, 43)
***************
66-777
(55, 61)
***************
88-999
(125, 131)

Upvotes: 2

Views: 46

Answers (2)

pho
pho

Reputation: 25489

You could use a negative lookbehind to match your pattern but not preceded by a hyphen

(?<!\-)\d{2}\-\d{3}

import re
regex_str = r'\b(?<!\-)\d{2}\-\d{3}\b'
msg_message = '11-222 should be matched, as well as 33-444 but not 55-66-777. If it is present at the end, that should also be matched like 88-999'
for match in re.finditer(regex_str, msg_message):
    print('*'*15)
    print(match.group(0))
    print(match.span())

***************
11-222
(0, 6)
***************
33-444
(37, 43)
***************
88-999
(125, 131)

You could do the same with a negative lookahead (?!\-) if you want to apply the same treatment to the right side of your expression.

Upvotes: 1

ctwheels
ctwheels

Reputation: 22817

You can use (?<!\S)\d{2}-\d{3}(?!\S). This pattern ensures a whitespace character (or no character - i.e. start/end of string) before and after.

See it in use here

How it works:

  • (?<!\S) ensure what precedes doesn't match a non-whitespace character
  • \d{2} match two digits
  • - match this character literally
  • \d{3} match three digits
  • (?!\S) ensure what follows doesn't match a non-whitespace character

The double negatives are used purposely. The alternative is to use (?<=\s|^) and (?=\s|$) respectively (but it's longer and less sexy).

Upvotes: 2

Related Questions