Verbal_Kint
Verbal_Kint

Reputation: 1416

regex conditional matching

I am trying to use re.findall to find this pattern:

01-234-5678
regex:
(\b\d{2}(?P<separator>[-:\s]?)\d{2}(?P=separator)\d{3}(?P=separator)\d{3}(?:(?P=separator)\d{4})?,?\.?\b)

however, some cases have shortened to 01-234-5 instead of 01-234-0005 when the last four digits are 3 zeros followed by a non-zero digit.

Since there does't seem to be any uniformity in formatting I had to account for a few different separator characters or possibly none at all. Luckily, I have only noticed this shortening when some separator has been used...

Is it possible to use a regex conditional to check if a separator does exist (not an empty string), then also check for the shortened variation?

So, something like if separator != '': re.findall(r'(\b\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)(\d{4}|\d{1})\.?\b)', text)

Or is my only option to include all the possibly incorrect 6 digit patterns then check for a separator with python?

Upvotes: 0

Views: 198

Answers (2)

jonrsharpe
jonrsharpe

Reputation: 122158

If you want the last group of digits to be "either one or four digits", try:

>>> import re
>>> example = "This has one pattern that you're expecting, 01-234-5678, and another that maybe you aren't: 23:456:7"
>>> pattern = re.compile(r'\b(\d{2}(?P<sep>[-:\s]?)\d{3}(?P=sep)\d(?:\d{3})?)\b')
>>> pattern.findall(example)
[('01-234-5678', '-'), ('23:456:7', ':')]

The last part of the pattern, \d(?:\d{3})?), means one digit, optionally followed by three more (i.e. one or four). Note that you don't need to include the optional full stop or comma, they're already covered by \b.


Given that you don't want to capture the case where there is no separator and the last section is a single digit, you could deal with that case separately:

r'\b(\d{9}|\d{2}(?P<sep>[-:\s])\d{3}(?P=sep)\d(?:\d{3})?)\b'
#    ^ exactly nine digits
#         ^ or
#                             ^ sep not optional

See this demo.

Upvotes: 2

fiacre
fiacre

Reputation: 1180

It is not clear why you are using word boundaries, but I have not seen your data.

Otherwise you can shorten the entire this to this:

re.compile(r'\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)\d{1,4}')

Note that \d{1,4} matched a string with 1, 2, 3 or 4 digits

If there is no separator, e.g. "012340008" will match the regex above as you are using [-:\s]? which matches 0 or 1 times.

HTH

Upvotes: 0

Related Questions