Jeff Meatball Yang
Jeff Meatball Yang

Reputation: 39017

Regular Expressions and The Pipe Operator

A little regex help please.

Why are these different?

Regex.Replace("(999) 555-0000 /x ext123", "/x.*|[^0-9]", String.Empty)
"9995550000"


Regex.Replace("(999) 555-0000 /x ext123", "[^0-9]|/x.*", String.Empty)
"9995550000123"

I thought the pipe operator did not care about order... or maybe there is something else that can explain this?

Upvotes: 4

Views: 4008

Answers (5)

Alan Moore
Alan Moore

Reputation: 75222

I think you've got the wrong idea about alternation (i.e., the pipe). In a pure DFA regex implementation, it's true that alternation favors the longest match no matter how the alternatives are ordered. In other words, the whole regex, whether it contains alternation or not, always returns the earliest and longest possible match--the "leftmost-longest" rule.

However, the regex implementations in most of today's popular programming languages, including .NET, are what Friedl calls Traditional NFA engines. One of the most important differences between them and DFA engines is that alternation is not greedy; it attempts the alternatives in the order they're listed and stops as soon as one of them matches. The only thing that will cause it to change its mind is if the match fails at a later point in the regex, forcing it to backtrack into the alternation.

Note that if you change the [^0-9] to [^0-9]+ in both regexes you'll get the same result from both--but not the one you want. (I'm assuming the /x.* alternative is supposed to match--and remove--the rest of the string, including the extension number.) I'd suggest something like this:

"[^0-9/]+|/x.*$"

That way, neither alternative can even start to match what the other one matches. Not only will that will prevent the kind of confusion you're experiencing, it avoids potential performance bottlenecks. One of the other major differences between DFA's and NFA's is that badly-written NFA's are prone to serious (even catastrophic) performance problems, and sloppy alternations are one of the easiest ways to trigger them.

Upvotes: 3

Matthew Scharley
Matthew Scharley

Reputation: 132254

If I took a wild guess, I'd say that it's running the first part of the expression first, and then the second part. So, what's happening in the second case is that it's removing all the non-numeric parts, which means that the second part will never match, and leaves you with the extension intact.

Since it has to run some part of the expression first, since it can't run both at the same time, I'd say this is a fairly natural assumption, though I can see why you might get caught... Definitely an interesting gotcha though.

EDIT: To address the wording, as Ben rightly pointed out, the expression is attempted to be matched starting with each character in the string. So, what happens in the second case is:

  • There is no "^" anchor, so we try at the start of each substring:
  • For "(999) 555-0000 /x ext123", "(" matches [^0-9], so replace that with nothing (remove it).
  • For "999) 555-0000 /x ext123", the "999" part doesn't match [^0-9], nor does it match /x.*, so we keep trying from the ")", which matches [^0-9], so we remove it.
  • And so on. When it gets to the "/", the same thing happens, it matches [^0-9] and is removed, meaning the second part of the regex can never, ever match.

In the first case, what happens is the following:

  • Again, no "^" anchor, so we try for all substrings:
  • For "(999) 555-0000 /x ext123", "(" does not match /x.*, but it does match [^0-9], so replace that with nothing (remove it).
  • For "999) 555-0000 /x ext123", the "999" part doesn't match /x.*, nor does it match [^0-9], so we keep trying from the ")", which doesn't match /x.*, but which matches [^0-9], so we remove it.
  • When we hit the "/x", this time /x.* does match, it matches "/x ext123", and the rest of the string is removed, leaving us with nothing to continue with.

Upvotes: 7

arsenbonbon
arsenbonbon

Reputation: 725

The problem in your expression is that /x.* matches greedily. In the first expression it is provided first, so the engine attempts to match it first, resulting in matching all of the rest of the string after /x too because of the .* .

If you change it to /x.*? you get the same result as the second expression. The ? after the * tells the regex engine to match non-greedy.

Check http://www.regular-expressions.info/repeat.html to learn more about greediness.

Upvotes: -2

Billy ONeal
Billy ONeal

Reputation: 106530

The alternation operator ( | ) attempts expressions in the order provided. In the first example, it tries to match the expression /x.* first, then it tries to match [^0-9].

Because the string " /x ext" can be matched by the first expression [^0-9], as in the second example, the second part of the expression, /x.* is never invoked.

Billy3

EDIT: More information here on the alternation operator: http://www.regular-expressions.info/alternation.html

Upvotes: 1

leppie
leppie

Reputation: 117220

You are missing some parenthesis. :)

Upvotes: -2

Related Questions