jpeltoniemi
jpeltoniemi

Reputation: 5620

Regular expressions matching priority weirdness

While writing a inflection script in php I came across a strange(at least to me) behavior. It seems that my regular expression that contains many different patterns for word endings matches the second acceptable option even when the first one is perfectly ok.

The full expression goes like this(you really don't need to pay attention to this):

([kpt])\1([aou])$|(?:(n)t)?([auo])$|([aeou][^aeiouyäö]+[aeiou])$|([auo][^aeiouyäö]+)$|([^aeiouyäö])(?!\6)([^aeiouyäö])(e)$|((?:[auo]i|[auo])[^aeiouyäö]*)([aeiouyäö])\10$

Needless to say, I first suspected that I had made a mistake somewhere, so I dumbed the expression down little by little and finally got to this stage:

([aeiou])$|(.+)$

Which literally should mean "Match one vowel at the end of the string and use it as backreference 1 or if that fails, match just any character any number of times before end of the string and use it as backreference 2"

When used with a string like foo I'd expect the first part of the expression (([aeiou])$) to be used. Instead the second one is used and that confuses me.

If the quantifier is removed, the first option is used. I'm guessing that this has to do something with the greediness or specificity of the expression parts, though I thought that the expression is tested left to right.

Could someone explain this behavior to me?

Upvotes: 1

Views: 335

Answers (1)

NPE
NPE

Reputation: 500703

Match one vowel at the end of the string [...] or if that fails, match just any character any number of times before end of the string [...]

No, that's not what it means. The correct interpretation is:

Match the longer of:

  • one vowel at the end of the string
  • any character any number of times before end of the string

(I don't know whether tie-breaking rules are well specified.)

Upvotes: 1

Related Questions