Reputation: 121
I'm coding a set of regex to match dates in text using python. One of my regex was designed to match dates in the format MM/YYYY only. The regex is the following:
r'\b((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})\b'
Looks like the word boundary is not working as it is matching parts of dates like 12/02/2020 (it should not match this date format at all).
In the attached image only the second pattern should have been recognized. The first one shouldn't, even parts of it, have been a match.
Remembering that the regex should match the MM/YYYY pattern in strings like:
"The range of dates go from 21/02/2020 to 21/03/2020 as specified above."
Can you help me find the error in my pattern to make it match only my goal format?
Upvotes: 1
Views: 1832
Reputation: 1243
A word boundary, in most regex dialects, is a position between \w
and \W
(non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_])
.
What is a word boundary in regex?
What happens is that the \
character is not part of the group \w
, thus every time your string has a new \
it is considered to be a new word boundary.
You have not provided the full string you are matching, but I could solve the example you have posted you could solve it by just putting the anchors ^$
^((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})$
https://regex101.com/r/xncZNN/1
edit:
Working on your full example and your regex I did some "clean up" because it was a bit confusing, but I think I understood the pattern you were trying to map
here is the new:
(?<=^|[a-zA-Z ])(0[0-9]|1[12]|[1-9])(?:\/|\\)([\d]{4})(?=[a-zA-Z ]|$)
I have substituted the word boundary by lookahead (?!...)
and lookbehind (?<!...)
, and specified the pattern I want to match before and after the date. You can adjust it to your specific need and add other characters like numbers or specific stuff.
https://regex101.com/r/xncZNN/4
Upvotes: 2
Reputation: 110685
The problem is that \b\d{2}/\d{4}\b
matches 02/2000
in the string 01/02/2000
because the first forward slash is a word break. The solution is to identify the characters that should not precede and follow the match and use negative lookarounds in place of word breaks. Here you could use the regular expression
r'(?<![\d/])(?:0[1-9]|1[0-2])/\d{4}(?![\d/])'
The negative lookbehind, (?<![\d/])
, prevents the two digits representing the month to be preceded by a digit or forward slash; the negative lookahead, (?![\d/])
prevents the four digits representing the year to be followed by a digit or forward slash.
If 6/2000
is to be matched as well as 06/2000
, change (?:0[1-9]
to (?:0?[1-9]
.
Upvotes: 2