remo
remo

Reputation: 3484

regex substring C#

I need help to figure the regex expression

I have

string = "STATE changed from [Fixed] to [Closed], CLOSED DATE added [Fri Jan 14 09:32:19 
MST 2011], NOTES changed from [CLOSED[]<br />] to [TEST CLOSED <br />]"

I need to grab NOTES changed from [CLOSED[]<br />] to [TEST CLOSED <br />] and take values CLOSED[] and TEST CLOSED in two string variables.
So far I got to:

Regex NotesChanged = new Regex(@"NOTES changed from \[(\w*|\W*)\] to \[([\w-|\W-]*)\]");

which matches only if "NOTES changed from" started at the beginning and has no '[]' within '[ ]', but I have "[CLOSED[]]" and also no "
". Any ideas on what to change in regex.

Thanks, Sharma

Upvotes: 0

Views: 2406

Answers (5)

John Leidegren
John Leidegren

Reputation: 60997

This is kind of wierd...

(\w*|\W*)

That a capturing group of all word characters zero or many times or all non word characters zero or many times

What you wanna do if you have matching braces is to create a pattern which doesn't consume the delimiter.

\[([^\]]+)\]

That will match any occurrence of [with some text in it] where the matched text is the first group in the match.

Since you have the same type of delimiters nested with in the string itself it gets a bit more tricker and you need to use "look-a-head" or some sort of alteration.

((?:[^\[\]]|\[\])*)

This can be future improved, but there's a problem here that can not be solved if you have [[[]]]. You cannot create a recursive regular expression. It is not that flexible. So you need to either hard code a max depth or apply the regular expression several times.

A fairly exhaustive way of doing this would be

\[((?:[^\[\]]*)(?:(?=\[)(?:[^\]]*)\])?([^\]]))\]

Upvotes: 1

James
James

Reputation: 82096

If you have the luxury of fixing the regex with specific keywords or phrases, the following would work:

NOTES changed from (?:(?:\[)?([A-Z]+\[\]))<br />\] to \[([A-Z]+\s+[A-Z]+)

The above would match the string NOTES changed from [CLOSED[]<br />] to [TEST CLOSED and put CLOSED[] and TEST CLOSED into 2 separate groups.

Update

In fact you can make this even shorter (and a bit more non-specific) by using the . specifier:

NOTES changed from (?:(?:\[)?([A-Z]+\[\])).+\[([A-Z]+\s+[A-Z]+)

This means it will match like the above, only instead of being specific about matching the <br /> tags etc in between it will match regardless of what is in between.

Upvotes: 0

John McDonald
John McDonald

Reputation: 1799

If "<br />" is going to be there every time, you can use one of my favourite patterns (and it's worth memorizing). The pattern is:

delim[^delim]*delim

The pattern above will match a delimiter, followed by anything except the delimiter as many times as possible, then the delimiter again.

Here is the regular expression I would be tempted to use:

NOTES changed from \[([^<]*)[^\]]*\] to \[([^<]*)[^\]]*\]

In English:

  • Grabs the opening [
  • Capture #1 all characters until the < (assuming the br tag is always there)
  • Reads until the closing ]
  • Repeat for second capture zone

Upvotes: 1

user197015
user197015

Reputation:

I believe you can use balancing group definitions to match the nested brackets. I believe these are .NET specific, at least in that particular implementation flavor. There's an example on that page, which I've adapted to your input here:

class Program {
    static void Main (string[] args) {
        var input = "STATE changed from [Fixed] to [Closed], CLOSED DATE added [Fri Jan 14 09:32:19 MST 2011], NOTES changed from [CLOSED[]] to [TEST CLOSED ]";
        var regex = new Regex(@"NOTES changed from (((?'open'\[)[^\[\]]*)+((?'close-open'\])[^\[\]]*)+)*");

        foreach (var match in regex.Matches(input)) {
            Console.WriteLine(match);
        }
    }
}

This prints NOTES changed from [CLOSED[]] to [TEST CLOSED ] for me. Note that in my adaption I left off the bit of the expression that causes it to fail to match if the square brackets are not properly balanced, in order to reduce my example to the barest minimum that would satisfy your request... the expression is already pretty unpleasantly complex.

EDIT: Just saw your question got edited a bit while I was posting. The parts of the regex I've supplied here that match "anything but [ and ]" should be able to be replaced with capture groups for the substrings you need to extract.

Upvotes: 0

Jordan Parmer
Jordan Parmer

Reputation: 37174

Try adding "\[|\]" to your capture sequence in the bracket group.

Regex NotesChanged = new Regex(@"NOTES changed from \[(\w*|\W*|\[|\])\] to \[([\w-|\W-|\[|\]]*)\]");

Upvotes: 0

Related Questions