podeig
podeig

Reputation: 2741

Regex for nested patterns

I have such text:

((#) This is text

    ((#) This is subtext 

        ((#) This is sub-subtext #)

    #)

 #)

I made following regex:

        var counter = 0;
        return Regex.Replace(text,
             @"\(\(#\)(.*?)#\)",
             m =>
             {
                var str = m.ToString();
                counter++;
                return counter + ") " + str.Replace("((#)", "").Replace("#)", "")
             });

So the result I expected would be like

1) This is text
   2) This is subtext
       3) This is sub-subtext

I know that this will not work properly, because regex will take #) from the second ((#) and so on.

How to avoid this conflict? Thanks! :)

Upvotes: 2

Views: 2108

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627119

Here is the solution I suggest:

  • Get the nested strings with the regex featuring balanced groups,
  • Replace the substrings in a loop.

See the regex demo here. It matches empty strings but also captures all nested substrings that start with ((#) and end with #).

Here is C# demo code:

var text = @"((#) This is text

    ((#) This is subtext 

        ((#) This is sub-subtext #)

     #)

#)";
var chunks = Regex.Matches(text,
            @"(?s)(?=(\(\(#\)(?>(?!\(\(#\)|#\)).|\(\(#\)(?<D>)|#\)(?<-D>))*(?(D)(?!))#\)))")
               .Cast<Match>().Select(p => p.Groups[1].Value)
               .ToList();
for (var i = 0; i < chunks.Count; i++)
     text = text.Replace(chunks[i], string.Format("{0}) {1}", (i+1), 
                         chunks[i].Substring(4, chunks[i].Length-6).Trim()));

Note that .Substring(4, chunks[i].Length-6) just gets a substring from ((#) up to #). Since we know the delimiters, we can hardcode these values.

Output:

enter image description here

To learn more about balancing groups, see Balancing Groups Definition and Fun With .NET Regex Balancing Groups.

Upvotes: 1

F.P
F.P

Reputation: 17831

I believe this to be impossible, because your grammar is at its core recursive:

TEXT := "((#)" TEXT "#)"

Which is something that cannot be consumed by a regular expression, because it can only handle languages created by regular grammar.

In that sense, the question linked by Ondrej actually does answer your problem, just not how you want it.

The only way you can handle this with regular expressions is by limiting yourself to a definitive depth of recursion and match everything up to this depth, which I think is not what you want.

To make this work for any number of nesting levels, you will have no other choice (that I know of) than using a parser for context-free languages.

Upvotes: 0

Related Questions