user1186144
user1186144

Reputation: 133

To find everything between { }

I'm new to regex and was hoping for a pointer towards finding matches for words which are between { } brackets which are words and the first letter is uppercase and the second is lowercase. So I want to ignore any numbers also words which contain numbers

{ test1, Test2, Test, 1213, Tsg12, Tesgd} , test5, test6, {abc, Abc}

so I would only want to bring back matches for:

Test
Tesgd
Abc

I've looked at using \b and \w for words that are bound and [Az] for upper followed by lower but not sure how to only get the words which are between the brackets only as well.

Upvotes: 0

Views: 267

Answers (3)

Alan Moore
Alan Moore

Reputation: 75222

In answer your original question, I would have offered this regex:

\b[A-Z][a-z]+\b(?=[^{}]*})

The last part is a positive lookahead; it notes the current match position, tries to match the enclosed subexpression, then returns the match position to where it started. In this case, it starts at the end of the word that was just matched and gobbles up as many characters it can as long as they're not { or }. If the next character after that is }, it means the word is inside a pair of braces, so the lookahead succeeds. If the next character is {, or if there's no next character because it's at the end of the string, the lookahead fails and the regex engine moves on to try the next word.

Unfortunately, that won't work because (as you mentioned in a comment) the braces may be nested. Matching any kind of nested or recursive structure is fundamentally incompatible with the way regexes work. Many regex flavors offer that capability anyway, but they tend to go about it in wildly different ways, and it's always ugly. Here's how I would do this in C#, using Balanced Groups:

  Regex r = new Regex(@"
      \b[A-Z][a-z]+\b
      (?!
        (?>
          [^{}]+
          |
          { (?<Open>)
          |
          } (?<-Open>)
        )*
        $
        (?(Open)(?!))
      )", RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace);
  string s = "testa Testb { Test1 Testc testd 1Test } Teste { Testf {testg Testh} testi } Testj";
  foreach (Match m in r.Matches(s))
  {
    Console.WriteLine(m.Value);
  }

output:

Testc
Testf
Testh

I'm still using a lookahead, but this time I'm using the group named Open as a counter to keep track of the number of opening braces relative to the number of closing braces. If the word currently under consideration is not enclosed in braces, then by the time the lookahead reaches the end of the string ($), the value of Open will be zero. Otherwise, whether it's positive or negative, the conditional construct - (?(Open)(?!)) - will interpret it as "true" and try to try to match (?!). That's a negative lookahead for nothing, which is guaranteed to fail; it's always possible to match nothing.

Nested or not, there's no need to use a lookbehind; a lookahead is sufficient. Most flavors place such severe restrictions on lookbehinds that nobody would even think to try using them for a job like this. .NET has no such restrictions, so you could do this in a lookbehind, but it wouldn't make much sense. Why do all that work when the other conditions--uppercase first letter, no digits, etc--are so much cheaper to test?

Upvotes: 0

Ali Ferhat
Ali Ferhat

Reputation: 2579

Here is your solution:

Regex r = new Regex(@"(?<={[^}]*?({(?<depth>)[^}]*?}(?<-depth>))*?[^}]*?)(?<myword>[A-Z][a-z]+?)(?=,|}|\Z)", RegexOptions.ExplicitCapture);
string s = "{ test1, Test2, Test, 1213, Tsg12, Tesgd} , test5, test6, {abc, Abc}";
var m = r.Matches(s);
foreach (Match match in m)
   Console.WriteLine(match.Groups["myword"].Value);

I assumed it is OK to match inside but not the deepest level paranthesis. Let's dissect the regex a bit. AAA means an arbitrary expression. www means an arbitrary identifier (sequence of letters)

  • . is any character
  • [A-Z] is as you can guess any upper case letter.
  • [^}] is any character but }
  • ,|}|\Z means , or } or end-of-string
  • *? means match what came before 0 or more times but lazily (Do a minimal match if possible and spit what you swallowed to make as many matches as possible)
  • (?<=AAA) means AAA should match on the left before you really try to match something.
  • (?=AAA) means AAA should match on the right after you really match something.
  • (?<www>AAA) means match AAA and give the string you matched the name www. Only used with ExplicitCapture option.
  • (?<depth>) matches everything but also pushes "depth" on the stack.
  • (?<-depth>) matches everything but also pops "depth" from the stack. Fails if the stack is empty.

We use the last two items to ensure that we are inside a paranthesis. It would be much simpler if there were no nested paranthesis or matches occured only in the deepest paranthesis.

The regular expression works on your example and probably has no bugs. However I tend to agree with others, you should not blindly copy what you cannot understand and maintain. Regular expressions are wonderful but only if you are willing to spend effort to learn them.

Edit: I corrected a careless mistake in the regex. (replaced .*? with [^}]*? in two places. Morale of the story: It's very easy to introduce bugs in Regex's.

Upvotes: 3

Adam Mihalcin
Adam Mihalcin

Reputation: 14458

Do the filtering in two steps. Use the regular expression

@"\{(.*)\}"

to pull out the pieces between the brackets, and the regular expression

@"\b([A-Z][a-z]+)\b"

to pull out each of the words that begins with a capital letter and is followed by lower case letters.

Upvotes: -1

Related Questions