Reputation: 57
For several different regular expressions I have found optional and conditional sections of the regex to behave differently for the first match and the subsequent matches. This is using python, but I found it to hold generically.
Here are two similar examples that illustrate the issue:
(?:\w. )?([^,.]*).*(\d{4}\w?)
J. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.
R. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.
Match 1
Match 2
((?:\w\. )?[^,.]*).*(\d{4}\w?)
J. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.
R. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.
Match 1
Match 2
I would expect this to behave a bit differently, I would think the matches would be consistent. What I think it should be (and don't yet understand why it isn't):
Match 1
Match 2
Match 1
Match 2
Upvotes: 3
Views: 91
Reputation: 2892
In your first example you expect the second line to match 'wang Wang'. <<example 1>> shows clearly that's not what's happening.
After the first match, - which ends with '2002.' - the regex tries to match the remaining part which starts with \n\nR. wang Wang
. In your first regex the first non-capturing group doesn't match with that, so your group 1 takes over and matches that, ending up with '\n\nR'
(?: # non-capturing group
\w. # word char, followed by 1 char, followed by space
)? # read 0 or 1 times
( # start group 1
[^,.]* # read anything that's not a comma or dot, 0 or more times
) # end group 1
.* # read anything
( # start group 2
\d{4} # until there's 4 digits
\w? # eventually followed by word char
) # end group 2
The same applies to your second regex: even here your non-capturing group (?:\w\. )?
doesn't consume the R.
because there are a dot and some newlines in front of the initials.
You could have solved it like this ([A-Z]\.)\s([^.,]+).*(\d{4})
: See example 3
Upvotes: 1