Reputation: 3065
I am working on my very own templating engine in the name of learning and hobbies. I have a regex that looks for if statements using a syntax almost identical to TWIG.
You can view the regex here with a few working examples and then one I am trying to make work.
Here is the regex:
{%\s*if\s+(?<var>(?:[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)(?:\.(?:[a-zA-Z0-9_\x7f-\xff]*))*)\s(?:(?<operation>=|!=|<=|<|>=|>)\s(?<var2>(?:(?:(?1)*)(?:\.(?:(?2)*))*)|(?:[0-9]+))\s?)?%}(?<if>(?:(?!{% e(?:lse|nd if) %}).)*)(?:{%\h?else\h?%}(?<else>[\s\S]*?))?{%\h?end if\s?%}
And here is the data it's processing:
THESE WORK
{% if thing %}
stuff
{% end if %}
{% if thing %}
stuff
{% else %}
other stuff
{% end if %}
{%if thing = thingy %}
stuff
{% else %}
other stuff
{% end if %}
THIS DOESN'T
Problem starts here:
{% if this = that %}
{% if item.currency = 0 %}
selected="selected"
{% else %}
you
{% end if %}
{% end if %}
Basically I would like the regex to search for the last {% end if %} tag and use everything in between as a string that I can recursively parse later.
Also, as a side note, is it appropriate to leave most of the question's information in a link to regex tester? Or should I also copy the bulk of the question in here (on SO)?
Upvotes: 2
Views: 1675
Reputation:
Rev 1
Since you are experimenting, after some fooling around, came up with a general regex for you.
This might add to your current knowledge, and it gives something to build on.
Synopsis:
In a pure regex solution, the notion of balanced text is about as far as
a regex engine has gone. It won't fill in the details.
For that you have to do it yourself.
This is a slow way of doing it as compared to a descent parser or such.
The difference is that it does not need to unwind to know where it is.
So, this will allow you to continue parsing when it encounters errors.
Kind of gets more meaning from something past an error to help in debug.
In doing this kind of thing, you should parse every single character.
So we parse the content, the delimiter begin, the core, the end, and errors.
In this case, we set aside 7 capture groups on the outer scope to skim information.
Content
- This is composed anything other than if/else/end if
.
Else
- This is the else
statement
Begin
- This is the beginning if
block
If_Content
- This is the if block content
Core
- This is all between
the outer begin and end. Contains nested content as well.
End
- This is the outer end if
block
Error
- This is the unbalanced error, it is either if
or end if
.
Usage:
In the host program, define a function called ParseCore()
This function needs to be passed (or know) the current core string.
If it were c++, it would be passed begin and end string iterators.
Anyway, the string has to be local to the function.
In this function, sit in a while loop parsing the string.
On each match, do an if/else seeing which group(s) from above matched.
It can only be these combinations
Content
or
Else
or
Begin, If_Content, Core, End
or
Error
Only one group is of importance for recursion. That is the Core
group.
When this group matches, you make a recursive function call to
ParseCore()
passing the Core string to it.
This repeats until no more matches.
Error reporting, creating a tree of structure and anything else can be done
within this function.
You could even set a global flag, at any point to unwind the recursive calls
and exit. Say for example you want to STOP on error or such.
NOTE: On the initial call to ParseCore()
you just pass in the entire original string, to kick off parse.
Good luck!
# (?s)(?:(?<Content>(?&_content))|(?<Else>(?&_else))|(?<Begin>{%\s*if\s+(?<If_Content>(?&_ifbody))\s*%})(?<Core>(?&_core)|)(?<End>{%\s*end\s+if\s*%})|(?<Error>(?&_keyword)))(?(DEFINE)(?<_ifbody>(?>(?!%}).)+)(?<_core>(?>(?<_content>(?>(?!(?&_keyword)).)+)|(?(<_else>)(?!))(?<_else>(?>{%\s*else\s*%}))|(?>{%\s*if\s+(?&_ifbody)\s*%})(?:(?=.)(?&_core)|){%\s*end\s+if\s*%})+)(?<_keyword>(?>{%\s*(?:if\s+(?&_ifbody)|end\s+if|else)\s*%})))
(?s) # Dot-all modifier
# =====================
# Outter Scope
# ---------------
(?:
(?<Content> # (1), Non-keyword CONTENT
(?&_content)
)
| # OR,
# --------------
(?<Else> # (2), ELSE
(?&_else)
)
| # OR
# --------------
(?<Begin> # (3), IF
{% \s* if \s+
(?<If_Content> # (4), if content
(?&_ifbody)
)
\s* %}
)
(?<Core> # (5), The CORE
(?&_core)
|
)
(?<End> # (6)
{% \s* end \s+ if \s* %} # END IF
)
| # OR
# --------------
(?<Error> # (7), Unbalanced IF or END IF
(?&_keyword)
)
)
# =====================
# Subroutines
# ---------------
(?(DEFINE)
# __ If Body ----------------------
(?<_ifbody> # (8)
(?>
(?! %} )
.
)+
)
# __ Core -------------------------
(?<_core> # (9)
(?>
#
# __ Content ( non-keywords )
(?<_content> # (10)
(?>
(?! (?&_keyword) )
.
)+
)
|
#
# __ Else
# Guard: Only 1 'else'
# allowed in this core !!
(?(<_else>)
(?!)
)
(?<_else> # (11)
(?> {% \s* else \s* %} )
)
|
#
# IF (block start)
(?>
{% \s* if \s+
(?&_ifbody)
\s* %}
)
# Recurse core
(?:
(?= . )
(?&_core)
|
)
# END IF (block end)
{% \s* end \s+ if \s* %}
)+
)
# __ Keyword ----------------------
(?<_keyword> # (12)
(?>
{% \s*
(?:
if \s+ (?&_ifbody)
| end \s+ if
| else
)
\s* %}
)
)
)
Sample input (removed)
Selected Output (removed)
Pseudo Code Usage Example
bool bStopOnError = false;
regex RxCore(".....");
bool ParseCore( string sCore, int nLevel )
{
// Locals
bool bFoundError = false;
bool bBeforeElse = true;
match _matcher;
while ( search ( core, RxCore, _matcher ) )
{
// Content
if ( _matcher["Content"].matched == true )
// Print non-keyword content
print ( _matcher["Content"].str() );
// OR, Analyze content.
// If this 'content' has error's and wish to return.
// if ( bStopOnError )
// bFoundError = true;
else
// Else
if ( _matcher["Else"].matched == true )
{
// Check if we are not in a recursion
if ( nLevel <= 0 )
{
// Report error, this 'else' is outside an 'if/end if' block
// ( note - will only occur when nLevel == 0 )
print ("\n>> Error, 'else' not in block " + _matcher["Else"].str() + "\n";
// If this 'else' error will stop the process.
if ( bStopOnError == true )
bFoundError = true;
}
else
{
// Here, we are inside a core recursion.
// That means there can only be 1 'else'.
// Print 'else'.
print ( _matcher["Else"].str() );
// Set the state of 'else'.
bBeforeElse == false;
}
}
else
// Error ( will only occur when nLevel == 0 )
if ( _matcher["Error"].matched == true )
{
// Report error
print ("\n>> Error, unbalanced " + _matcher["Error"].str() + "\n";
// // If this unbalanced 'if/end if' error will stop the process.
if ( bStopOnError == true )
bFoundError = true;
}
else
// IF/END IF block
if ( _matcher["Begin"].matched == true )
{
// Analyze 'if content' for error and wish to return.
string sIfContent = _matcher["If_Content"].str();
// if ( bStopOnError )
// bFoundError = true;
// else
// {
// Print 'begin' ( includes 'if content' )
print ( _matcher["Begin"].str() );
//////////////////////////////
// Recurse a new 'core'
bool bResult = ParseCore( _matcher["Core"].str(), nLevel+1 );
//////////////////////////////
// Check recursion result. See if we should unwind.
if ( bResult == false && bStopOnError == true )
bFoundError = true;
else
// Print 'end'
print ( _matcher["End"].str() );
// }
}
else
{
// Reserved placeholder, won't get here at this time.
}
// Error-Return Check
if ( bFoundError == true && bStopOnError == true )
return false;
}
// Finished this core!! Return true.
return true;
}
///////////////////////////////
// Main
string strInitial = "...";
bool bResult = ParseCore( strInitial, 0 );
if ( bResult == false )
print ( "Parse terminated abnormally, check messages!\n" );
Upvotes: 4