Pratik K. Shah
Pratik K. Shah

Reputation: 457

C# Regular Expression - break text into specific pieces

Friends I am applying R.E @"^(.*)([0-9/+-^]+)=([0-9]+)$"on string "3u->4+5=8". While fetching Group[1] it returns "3u->4+" and Group[2] it returns "5".

According to me,

Group[0]="3u->4+5=8"   
Group[1]="3u->"   
Group[2]="4+5"   
Group[3]="8"

Should be there. Kindly help

Upvotes: 1

Views: 111

Answers (1)

Robin
Robin

Reputation: 9644

Your issue is caused by the use of a greedy quantifier .* than will try to "eat up" everything it can.

Use a lazy quantifier instead:

^(.*?)([0-9/+^-]+)=([0-9]+)

This will cause .*? to match as little a possible while finding an overall match: the quantifier will stop at the 4 in your example.

Also don't forget - is a special character inside of a character class, to escape it you need to put it at the beginning or the end ([...-]) or [+-^] will become a range.


What's happening

Our regex (.*)([0-9/+-^]+), like any other regex, wants to return a match. In order to do that, it needs to find: "anything with any length, followed by at least a character in the [0-9/+-^] range".

Following only this rule, when applied on 3u->4+5 the regex could at first view match:

  • 3u->4+ in the first group, 5 in the second (only one digit is required for the second group to match)
  • 3u->4 in the first group, +5 in the second
  • 3u-> in the first group, 4+5 in the second

So, which one should we match?

In order to know which one to pick, the (heuristic and simplified) rule is:

  • if the * quantifier is greedy it will always try to match the most it can
  • if it is lazy (so if you're using *?) it will match the least it can (while the regex is still returning a global match).

You can read more on the subject here or here, where the general underlying rules and subtleties are being explained more in depth.

Upvotes: 3

Related Questions