ChevCast
ChevCast

Reputation: 59231

Regular expression can't handle rogue square brackets

Thanks to the smarties on here in the past I have this amazing recursive regular expression that helps me to transform custom BBCode-style tags in a block of text.

/// <summary>
/// Static class containing common regular expression strings.
/// </summary>
public static class RegularExpressions
{
    /// <summary>
    /// Expression to find all root-level BBCode tags. Use this expression recursively to obtain nested tags.
    /// </summary>
    public static string BBCodeTags
    {
        get
        {
            return @"
                    (?>
                      \[ (?<tag>[^][/=\s]+) \s*
                      (?: = \s* (?<val>[^][]*) \s*)?
                      ]
                    )

                    (?<content>
                      (?>
                        \[(?<innertag>[^][/=\s]+)[^][]*]
                        |
                        \[/(?<-innertag>\k<innertag>)]
                        |
                        [^][]+
                      )*
                      (?(innertag)(?!))
                    )

                    \[/\k<tag>]
                    ";
        }
    }
}

This regex works beautifully, recursively matching on all tags. Like this:

[code]  
    some code  
    [b]some text [url=http://www.google.com]some link[/url][/b]  
[/code]

The regex does exactly what I want and matches the [code] tag. It breaks it up into three groups: tag, optional value, and content. Tag being the tag name ("code" in this case). Optional value being a value after the equals(=) sign if there is one. And content being everything between the opening and closing tag.

The regex can be used recursively to match nested tags. So after matching on [code] I would run it again against the content group and it would match the [b] tag. If I ran it again on the next content group it would then match the [url] tag.

All of that is wonderful and delicious but it hiccups on one issue. It can't handle rogue square brackets.

[code]This is a successful match.[/code]

[code]This is an [ unsuccessful match.[/code]

[code]This is also an [unsuccessful] match.[/code]

I really suck at regular expressions but if anyone knows how I might tweak this regex to correctly ignore rogue brackets (brackets that do not make up an opening tag and/or do not have a matching closing tag) so that it still matches the surrounding tags, I would be very appreciative :D

Thanks in advance!

Edit

If you are interested in seeing the method where I use this expression you are welcome to.

Upvotes: 3

Views: 745

Answers (3)

Miguel Angelo
Miguel Angelo

Reputation: 24202

I did a program that can parse your strings in a debugable, developer-friendly way. It is not a small code like those regexes, but it has a positive side: you can debug it, and fine tune it as you need.

The implementation is a descent recursive parser, but if you need some kind of contextual data, you can place it all inside the ParseContext class.

It is quite long, but I consider it as being better than a a regex based solution.

To test it, create a console application, and replace all the code inside Program.cs with the following code:

using System.Collections.Generic;
namespace q7922337
{
    static class Program
    {
        static void Main(string[] args)
        {
            var result1 = Match.ParseList<TagsGroup>("[code]This is a successful match.[/code]");
            var result2 = Match.ParseList<TagsGroup>("[code]This is an [ unsuccessful match.[/code]");
            var result3 = Match.ParseList<TagsGroup>("[code]This is also an [unsuccessful] match.[/code]");
            var result4 = Match.ParseList<TagsGroup>(@"
                        [code]  
                            some code  
                            [b]some text [url=http://www.google.com]some link[/url][/b]  
                        [/code]");
        }
        class ParseContext
        {
            public string Source { get; set; }
            public int Position { get; set; }
        }
        abstract class Match
        {
            public override string ToString()
            {
                return this.Text;
            }
            public string Source { get; set; }
            public int Start { get; set; }
            public int Length { get; set; }
            public string Text { get { return this.Source.Substring(this.Start, this.Length); } }
            protected abstract bool ParseInternal(ParseContext context);
            public bool Parse(ParseContext context)
            {
                var result = this.ParseInternal(context);
                this.Length = context.Position - this.Start;
                return result;
            }
            public bool MarkBeginAndParse(ParseContext context)
            {
                this.Start = context.Position;
                var result = this.ParseInternal(context);
                this.Length = context.Position - this.Start;
                return result;
            }
            public static List<T> ParseList<T>(string source)
                where T : Match, new()
            {
                var context = new ParseContext
                {
                    Position = 0,
                    Source = source
                };
                var result = new List<T>();
                while (true)
                {
                    var item = new T { Source = source, Start = context.Position };
                    if (!item.Parse(context))
                        break;
                    result.Add(item);
                }
                return result;
            }
            public static T ParseSingle<T>(string source)
                where T : Match, new()
            {
                var context = new ParseContext
                {
                    Position = 0,
                    Source = source
                };
                var result = new T { Source = source, Start = context.Position };
                if (result.Parse(context))
                    return result;
                return null;
            }
            protected List<T> ReadList<T>(ParseContext context)
                where T : Match, new()
            {
                var result = new List<T>();
                while (true)
                {
                    var item = new T { Source = this.Source, Start = context.Position };
                    if (!item.Parse(context))
                        break;
                    result.Add(item);
                }
                return result;
            }
            protected T ReadSingle<T>(ParseContext context)
                where T : Match, new()
            {
                var result = new T { Source = this.Source, Start = context.Position };
                if (result.Parse(context))
                    return result;
                return null;
            }
            protected int ReadSpaces(ParseContext context)
            {
                int startPos = context.Position;
                int cnt = 0;
                while (true)
                {
                    if (startPos + cnt >= context.Source.Length)
                        break;
                    if (!char.IsWhiteSpace(context.Source[context.Position + cnt]))
                        break;
                    cnt++;
                }
                context.Position = startPos + cnt;
                return cnt;
            }
            protected bool ReadChar(ParseContext context, char p)
            {
                int startPos = context.Position;
                if (startPos >= context.Source.Length)
                    return false;
                if (context.Source[startPos] == p)
                {
                    context.Position = startPos + 1;
                    return true;
                }
                return false;
            }
        }
        class Tag : Match
        {
            protected override bool ParseInternal(ParseContext context)
            {
                int startPos = context.Position;
                if (!this.ReadChar(context, '['))
                    return false;
                this.ReadSpaces(context);
                if (this.ReadChar(context, '/'))
                    this.IsEndTag = true;
                this.ReadSpaces(context);
                var validName = this.ReadValidName(context);
                if (validName != null)
                    this.Name = validName;
                this.ReadSpaces(context);
                if (this.ReadChar(context, ']'))
                    return true;
                context.Position = startPos;
                return false;
            }
            protected string ReadValidName(ParseContext context)
            {
                int startPos = context.Position;
                int endPos = startPos;
                while (char.IsLetter(context.Source[endPos]))
                    endPos++;
                if (endPos == startPos) return null;
                context.Position = endPos;
                return context.Source.Substring(startPos, endPos - startPos);
            }
            public bool IsEndTag { get; set; }
            public string Name { get; set; }
        }
        class TagsGroup : Match
        {
            public TagsGroup()
            {
            }
            protected TagsGroup(Tag openTag)
            {
                this.Start = openTag.Start;
                this.Source = openTag.Source;
                this.OpenTag = openTag;
            }
            protected override bool ParseInternal(ParseContext context)
            {
                var startPos = context.Position;
                if (this.OpenTag == null)
                {
                    this.ReadSpaces(context);
                    this.OpenTag = this.ReadSingle<Tag>(context);
                }
                if (this.OpenTag != null)
                {
                    int textStart = context.Position;
                    int textLength = 0;
                    while (true)
                    {
                        Tag tag = new Tag { Source = this.Source, Start = context.Position };
                        while (!tag.MarkBeginAndParse(context))
                        {
                            if (context.Position >= context.Source.Length)
                            {
                                context.Position = startPos;
                                return false;
                            }
                            context.Position++;
                            textLength++;
                        }
                        if (!tag.IsEndTag)
                        {
                            var tagGrpStart = context.Position;
                            var tagGrup = new TagsGroup(tag);
                            if (tagGrup.Parse(context))
                            {
                                if (textLength > 0)
                                {
                                    if (this.Contents == null) this.Contents = new List<Match>();
                                    this.Contents.Add(new Text { Source = this.Source, Start = textStart, Length = textLength });
                                    textStart = context.Position;
                                    textLength = 0;
                                }
                                this.Contents.Add(tagGrup);
                            }
                            else
                            {
                                textLength += tag.Length;
                            }
                        }
                        else
                        {
                            if (tag.Name == this.OpenTag.Name)
                            {
                                if (textLength > 0)
                                {
                                    if (this.Contents == null) this.Contents = new List<Match>();
                                    this.Contents.Add(new Text { Source = this.Source, Start = textStart, Length = textLength });
                                    textStart = context.Position;
                                    textLength = 0;
                                }
                                this.CloseTag = tag;
                                return true;
                            }
                            else
                            {
                                textLength += tag.Length;
                            }
                        }
                    }
                }
                context.Position = startPos;
                return false;
            }
            public Tag OpenTag { get; set; }
            public Tag CloseTag { get; set; }
            public List<Match> Contents { get; set; }
        }
        class Text : Match
        {
            protected override bool ParseInternal(ParseContext context)
            {
                return true;
            }
        }
    }
}

If you use this code, and someday find that you need optimizations because the parser has become ambiguous, then try using a dictionary in the ParseContext, take a look here for more info: http://en.wikipedia.org/wiki/Top-down_parsing in the topic Time and space complexity of top-down parsing. I find it very interesting.

Upvotes: 3

Kobi
Kobi

Reputation: 138097

Ok. Here's another attempt. This one is a little more complicated.
The idea is to match the whole text from start to ext, and parse it to a single Match. While rarely used as such, .Net Balancing Groups allow you to fine tune your captures, remembering all positions and captures exactly the way you want them.
The pattern I came up with is:

\A
(?<StartContentPosition>)
(?:
    # Open tag
    (?<Content-StartContentPosition>)          # capture the content between tags
    (?<StartTagPosition>)                      # Keep the starting postion of the tag
    (?>\[(?<TagName>[^][/=\s]+)[^\]\[]*\])     # opening tag
    (?<StartContentPosition>)                  # start another content capture
    |
    # Close tag
    (?<Content-StartContentPosition>)          # capture the content in the tag
    \[/\k<TagName>\](?<Tag-StartTagPosition>)  # closing tag, keep the content in the <tag> group
    (?<-TagName>)
    (?<StartContentPosition>)                  # start another content capture
    |
    .           # just match anything. The tags are first, so it should match
                # a few if it can. (?(TagName)(?!)) keeps this in line, so
                # unmatched tags will not mess with the resul
)*
(?<Content-StartContentPosition>)          # capture the content after the last tag
\Z
(?(TagName)(?!))

Remember - the balancing group (?<A-B>) captures into A all text since B was last captured (and pops that position from B's stack).

Now you can parse the string using:

Match match = Regex.Match(sample, pattern, RegexOptions.Singleline |
                                           RegexOptions.IgnorePatternWhitespace);

Your interesting data will be on match.Groups["Tag"].Captures, which contains all tags (some of them are contained in others), and match.Groups["Content"].Captures, which contains tag's contents, and contents between tags. For example, without all blanks, it contains:

  • some code
  • some text
  • This is also an successful match.
  • This is also an [ unsuccessful match.
  • This is also an [unsuccessful] match.

This is pretty close to a full parsed document, but you'll still have to play with indices and length to figure out the exact order and structure of the document (though it isn't more complex than sorting all captures)

At this point I'll state what others have said - it may be a good time to write a parser, this pattern isn't pretty...

Upvotes: 0

Kobi
Kobi

Reputation: 138097

The first change is pretty simple - you can get it by changing [^][]+, which is responsible for matching the free text, to .. This seems a little crazy, perhaps, but it's actually safe, because you are using a possessive group (?> ), so all the valid tags will be matched by the first alternation - \[(?<innertag>[^][/=\s]+)[^][]*] - and cannot backtrack and break the tags.
(Remember to enable the Singleline flag, so . matches newlines)

The second requirement, [unsuccessful], seems to go against your goal it. The whole idea from the very start is not to match these unclosed tags. If you allow unclosed tags, all matches of the form \[(.*?)\].*?[/\1] become valid. Not good. At best, you can try to whitelist a few tags which are not allowed to be matched.

An example of both changes:

(?>
\[ (?<tag>[^][/=\s]+) \s*
(?: = \s* (?<val>[^][]*) \s*)?
\]
)
  (?<content>
    (?>
       \[(?:unsuccessful)\]  # self closing
       |
       \[(?<innertag>[^][/=\s]+)[^][]*]
       |
       \[/(?<-innertag>\k<innertag>)]
       |
       .
    )*
    (?(innertag)(?!))
  )
\[/\k<tag>\]

Working example on Regex Hero

Upvotes: 1

Related Questions