Reputation: 69
This regex will get 456. My question is why it CANNOT be 234 from 1-234-56 ? Does 56 qualify the (?!\d)) pattern since it is NOT a single digit. Where is the beginning point that (?!\d)) will look for?
import re
pattern = re.compile(r'\d{1,3}(?=(\d{3})+(?!\d))')
a = pattern.findall("The number is: 123456") ; print(a)
It is in the first stage to add the comma separator like 123,456.
a = pattern.findall("The number is: 123456") ; print(a)
results = pattern.finditer('123456')
for result in results:
print ( result.start(), result.end(), result)
Upvotes: 1
Views: 1023
Reputation: 627292
My question is why it CANNOT be
234
from1-234-56
?
It is not possible as (?=(\d{3})+(?!\d))
requires 3-digit sequences appear after a 1-3-digit sequence. 56
(the last digit group in your imagined scenario) is a 2-digit group. Since a quantifier can be either lazy or greedy, you cannot match both one, two and three digit groups with \d{1,3}
. To get 234
from 123456
, you'd need a specifically tailored regex for it: \B\d{3}
, or (?<=1)\d{3}
or even \d{3}(?=\d{2}(?!\d))
.
Does
56
match the(?!\d))
pattern? Where is the beginning point that (?!\d)) will look for?
No, this is a negative lookahead, it does not match, it only checks if there is no digit right after the current position in the input string. If there is a digit, the match is failed (not result found and returned).
More clarification on the look-ahead: it is located after (\d{3})+
subpattern, thus the regex engine starts searching for a digit right after the last 3-digit group, and fails a match if the digit is found (as it is a negative lookahead). In plain words, the (?!\d)
is a number closing/trailing boundary in this regex.
A more detailed breakdown:
\d{1,3}
- 1 to 3 digit sequence, as many as possible (greedy quantifier is used)(?=(\d{3})+(?!\d))
- a positive look-ahead ((?=...)
) that checks if the 1-3 digit sequence matched before are followed by
(\d{3})+
- 1 or more (+
) sequences of exactly 3 digits...(?!\d)
- not followed by a digit.Lookaheads do not match, do not consume characters, but you still can capture inside them. When a lookahead is executed, the regex index is at the same character as before. With your regex and input, you match 123
with \d{1,3}
as then you have 3-digit sequence (456
). But 456
is capured within a lookahead, and re.findall
returns only captured texts if capturing groups are set.
To just add comma as digit grouping symbol, use
rx = r'\d(?=(?:\d{3})+(?!\d))'
See IDEONE demo
Upvotes: 1