Matthew Goulart
Matthew Goulart

Reputation: 3065

Using regex to parse nested IF statements

I am working on my very own templating engine in the name of learning and hobbies. I have a regex that looks for if statements using a syntax almost identical to TWIG.

You can view the regex here with a few working examples and then one I am trying to make work.

Here is the regex:

{%\s*if\s+(?<var>(?:[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)(?:\.(?:[a-zA-Z0-9_\x7f-\xff]*))*)\s(?:(?<operation>=|!=|<=|<|>=|>)\s(?<var2>(?:(?:(?1)*)(?:\.(?:(?2)*))*)|(?:[0-9]+))\s?)?%}(?<if>(?:(?!{% e(?:lse|nd if) %}).)*)(?:{%\h?else\h?%}(?<else>[\s\S]*?))?{%\h?end if\s?%}

And here is the data it's processing:

THESE WORK
        {% if thing %}
        stuff
        {% end if %}

        {% if thing %}
        stuff
        {% else %}
        other stuff
        {% end if %}

        {%if thing = thingy %}
        stuff
        {% else %}
        other stuff
        {% end if %}
THIS DOESN'T
        Problem starts here:
        {% if this = that %}
        {% if item.currency = 0 %}
        selected="selected"
        {% else %}
        you
        {% end if %}
        {% end if %}

Basically I would like the regex to search for the last {% end if %} tag and use everything in between as a string that I can recursively parse later.

Also, as a side note, is it appropriate to leave most of the question's information in a link to regex tester? Or should I also copy the bulk of the question in here (on SO)?

Upvotes: 2

Views: 1675

Answers (1)

user557597
user557597

Reputation:

Rev 1

Since you are experimenting, after some fooling around, came up with a general regex for you.

This might add to your current knowledge, and it gives something to build on.

Synopsis:

In a pure regex solution, the notion of balanced text is about as far as
a regex engine has gone. It won't fill in the details.
For that you have to do it yourself.

This is a slow way of doing it as compared to a descent parser or such.
The difference is that it does not need to unwind to know where it is.
So, this will allow you to continue parsing when it encounters errors.
Kind of gets more meaning from something past an error to help in debug.

In doing this kind of thing, you should parse every single character.
So we parse the content, the delimiter begin, the core, the end, and errors.

In this case, we set aside 7 capture groups on the outer scope to skim information.

Content - This is composed anything other than if/else/end if.

Else - This is the else statement

Begin - This is the beginning if block

If_Content - This is the if block content

Core - This is all between the outer begin and end. Contains nested content as well.

End - This is the outer end if block

Error - This is the unbalanced error, it is either if or end if.

Usage:

In the host program, define a function called ParseCore()
This function needs to be passed (or know) the current core string.
If it were c++, it would be passed begin and end string iterators.
Anyway, the string has to be local to the function.

In this function, sit in a while loop parsing the string.
On each match, do an if/else seeing which group(s) from above matched.
It can only be these combinations

Content
or
Else
or
Begin, If_Content, Core, End
or
Error

Only one group is of importance for recursion. That is the Core group.
When this group matches, you make a recursive function call to
ParseCore() passing the Core string to it.

This repeats until no more matches.
Error reporting, creating a tree of structure and anything else can be done
within this function.
You could even set a global flag, at any point to unwind the recursive calls
and exit. Say for example you want to STOP on error or such.

NOTE: On the initial call to ParseCore() you just pass in the entire original string, to kick off parse.

Good luck!

Expanded

 # (?s)(?:(?<Content>(?&_content))|(?<Else>(?&_else))|(?<Begin>{%\s*if\s+(?<If_Content>(?&_ifbody))\s*%})(?<Core>(?&_core)|)(?<End>{%\s*end\s+if\s*%})|(?<Error>(?&_keyword)))(?(DEFINE)(?<_ifbody>(?>(?!%}).)+)(?<_core>(?>(?<_content>(?>(?!(?&_keyword)).)+)|(?(<_else>)(?!))(?<_else>(?>{%\s*else\s*%}))|(?>{%\s*if\s+(?&_ifbody)\s*%})(?:(?=.)(?&_core)|){%\s*end\s+if\s*%})+)(?<_keyword>(?>{%\s*(?:if\s+(?&_ifbody)|end\s+if|else)\s*%})))

 (?s)                               # Dot-all modifier

 # =====================
 # Outter Scope
 # ---------------

 (?:
      (?<Content>                        # (1), Non-keyword CONTENT
           (?&_content) 
      )
   |                                   # OR,
      # --------------
      (?<Else>                           # (2), ELSE
           (?&_else) 
      )
   |                                   # OR
      # --------------
      (?<Begin>                          # (3), IF
           {% \s* if \s+ 
           (?<If_Content>                     # (4), if content
                (?&_ifbody) 
           )
           \s* %}
      )
      (?<Core>                           # (5), The CORE
           (?&_core) 
        |  
      )
      (?<End>                            # (6)
           {% \s* end \s+ if \s* %}           # END IF
      )
   |                                   # OR
      # --------------
      (?<Error>                          # (7), Unbalanced IF or END IF
           (?&_keyword) 
      )
 )

 # =====================
 #  Subroutines
 # ---------------

 (?(DEFINE)

      # __ If Body ----------------------
      (?<_ifbody>                        # (8)
           (?>
                (?! %} )
                . 
           )+
      )

      # __ Core -------------------------
      (?<_core>                          # (9)
           (?>
                #
                # __ Content ( non-keywords )
                (?<_content>                       # (10)
                     (?>
                          (?! (?&_keyword) )
                          . 
                     )+
                )
             |  
                #
                # __ Else
                # Guard:  Only 1 'else'
                # allowed in this core !!

                (?(<_else>)
                     (?!)
                )
                (?<_else>                          # (11)
                     (?> {% \s* else \s* %} )
                )
             |  
                #
                # IF  (block start)
                (?>
                     {% \s* if \s+ 
                     (?&_ifbody) 
                     \s* %}
                )
                # Recurse core
                (?:
                     (?= . )
                     (?&_core) 
                  |  
                )
                # END IF  (block end)
                {% \s* end \s+ if \s* %}
           )+
      )

      # __ Keyword ----------------------
      (?<_keyword>                       # (12)
           (?>

                {% \s* 
                (?:
                     if \s+ (?&_ifbody) 
                  |  end \s+ if
                  |  else
                )
                \s* %}
           )
      )
 )

Sample input (removed)
Selected Output (removed)

Pseudo Code Usage Example

bool bStopOnError = false;
regex RxCore(".....");

bool ParseCore( string sCore, int nLevel )
{
    // Locals
    bool bFoundError = false; 
    bool bBeforeElse = true;
    match _matcher;

    while ( search ( core, RxCore, _matcher ) )
    {
      // Content
        if ( _matcher["Content"].matched == true )
          // Print non-keyword content
          print ( _matcher["Content"].str() );

          // OR, Analyze content.
          // If this 'content' has error's and wish to return.
          // if ( bStopOnError )
          //   bFoundError = true;

        else
      // Else 
        if ( _matcher["Else"].matched == true )
        {
            // Check if we are not in a recursion
            if ( nLevel <= 0 )
            {
               // Report error, this 'else' is outside an 'if/end if' block
               // ( note - will only occur when nLevel == 0 )
               print ("\n>> Error, 'else' not in block " + _matcher["Else"].str() + "\n";

               // If this 'else' error will stop the process.
               if ( bStopOnError == true )
                  bFoundError = true;
            }
            else
            {
                // Here, we are inside a core recursion.
                // That means there can only be 1 'else'.
                // Print 'else'.
                print ( _matcher["Else"].str() );

                // Set the state of 'else'. 
                bBeforeElse == false;   
            }
        }

        else
      // Error ( will only occur when nLevel == 0 )
        if ( _matcher["Error"].matched == true )
        {
            // Report error 
            print ("\n>> Error, unbalanced " + _matcher["Error"].str() + "\n";
            // // If this unbalanced 'if/end if' error will stop the process.
            if ( bStopOnError == true )
                bFoundError = true;
        }

        else
      // IF/END IF block
        if ( _matcher["Begin"].matched == true )
        {
            // Analyze 'if content' for error and wish to return.
            string sIfContent = _matcher["If_Content"].str();
            // if ( bStopOnError )
            //   bFoundError = true;
            // else
            // {            
                 // Print 'begin' ( includes 'if content' )
                 print ( _matcher["Begin"].str() );

                 //////////////////////////////
                 // Recurse a new 'core'
                 bool bResult = ParseCore( _matcher["Core"].str(), nLevel+1 );
                 //////////////////////////////

                 // Check recursion result. See if we should unwind.
                 if ( bResult == false && bStopOnError == true )
                     bFoundError = true;
                 else
                     // Print 'end'
                     print ( _matcher["End"].str() );
            // }
        }
        else 
        {
           // Reserved placeholder, won't get here at this time.
        }

      // Error-Return Check
         if ( bFoundError == true && bStopOnError == true )
             return false;
    }

    // Finished this core!! Return true.
    return true;
}

///////////////////////////////
// Main

string strInitial = "...";

bool bResult = ParseCore( strInitial, 0 );
if ( bResult == false )
   print ( "Parse terminated abnormally, check messages!\n" );

Upvotes: 4

Related Questions