cjlovering
cjlovering

Reputation: 57

Python Regex Inconsistency

For several different regular expressions I have found optional and conditional sections of the regex to behave differently for the first match and the subsequent matches. This is using python, but I found it to hold generically.

Here are two similar examples that illustrate the issue:

First Example:

expression:

(?:\w. )?([^,.]*).*(\d{4}\w?)

text:

J. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.

R. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.

matches:

Match 1

  1. wang Wang
  2. 2002

Match 2

  1. R
  2. 2002

Second example:

expression:

((?:\w\. )?[^,.]*).*(\d{4}\w?)

text:

J. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.

R. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.

matches:

Match 1

  1. J. wang Wang
  2. 2002

Match 2

  1. R
  2. 2002

What am I missing?

I would expect this to behave a bit differently, I would think the matches would be consistent. What I think it should be (and don't yet understand why it isn't):

Example 1

Match 1

  1. wang Wang
  2. 2002

Match 2

  1. wang Wang
  2. 2002

Example 2

Match 1

  1. J. wang Wang
  2. 2002

Match 2

  1. R. wang Wang
  2. 2002

Upvotes: 3

Views: 91

Answers (1)

Marc Lambrichs
Marc Lambrichs

Reputation: 2892

In your first example you expect the second line to match 'wang Wang'. <<example 1>> shows clearly that's not what's happening.

After the first match, - which ends with '2002.' - the regex tries to match the remaining part which starts with \n\nR. wang Wang. In your first regex the first non-capturing group doesn't match with that, so your group 1 takes over and matches that, ending up with '\n\nR'

(?:                   # non-capturing group 
  \w.                 # word char, followed by 1 char, followed by space
)?                    # read 0 or 1 times      
(                     # start group 1
[^,.]*                # read anything that's not a comma or dot, 0 or more times
)                     # end group 1
.*                    # read anything 
(                     # start group 2
\d{4}                 # until there's 4 digits 
\w?                   # eventually followed by word char
)                     # end group 2

The same applies to your second regex: even here your non-capturing group (?:\w\. )? doesn't consume the R. because there are a dot and some newlines in front of the initials.

You could have solved it like this ([A-Z]\.)\s([^.,]+).*(\d{4}): See example 3

Upvotes: 1

Related Questions