user1315621
user1315621

Reputation: 3412

re not matching when using an OR

I am trying to match dates (number in this case) from the following string:

mystring = '_20180701_20190630'

I am using the following code:

re.findall(r'(?:\A|_){1}([0-9]{4}[_]{0,1}[0-9]{2}[_]{0,1}[0-9]{2})(?:$|_){1}', mystring)

The pattern that I am looking for is overcomplicated for this very example because I need to take into account also other more complex situations.

Given that, I do not understand why the pattern above does not match the last number, while the following one does (the only difference is the end last characters: (?:$){1} vs (?:$|_){1} ):

re.findall(r'(?:\A|_){1}([0-9]{4}[_]{0,1}[0-9]{2}[_]{0,1}[0-9]{2})(?:$){1}', mystring) 

Why does OR operator prevent the match? Is it because it is greedy and there is another number before?

Upvotes: 2

Views: 61

Answers (2)

user12097764
user12097764

Reputation:

In the target sample _20180701_20190630 there is no match of the beginning of string
\A. Why is it offered in the alternation (?:\A|_) ?

Can the number possibly have no preceding _ if at the beginning of string ?

Basically, if this is not a Multi-line operation, the regex should be this :

(?<![^_])(\d\d\d\d(?:_?\d\d){2})(?![^_])

https://regex101.com/r/HkGZEo/1
https://regex101.com/r/PkwEdK/1
https://regex101.com/r/VAREFJ/1

for boundary consistency, and drop the anchors entirely.

Expanded

 (?<! [^_] )                   # Look Behind, a _ or BOS
 (                             # (1 start)
      \d\d\d\d 
      (?: _? \d\d ){2}
 )                             # (1 end)
 (?! [^_] )                    # Look Ahead, a _ or EOS

Upvotes: 0

anubhava
anubhava

Reputation: 785631

Your regex is actually matching and consuming trailing _ which is failing next match that must start with _.

You may use lookahead to solve this:

(?:\A|_)(\d{4}_?\d{2}_?\d{2})(?=_|\Z)

RegEx Demo

By using a positive lookaead i.e. (?=_|\Z), we are only asserting presence of _ or \Z but not really matching it.

I have also refactored your regex to simplify. {1} can be removed and {0.1} can be replaced with just ? (optional match). Similarly [_] can be just _ while [0-9] can be shortened to \d.

Upvotes: 2

Related Questions