Reputation: 2610
I'm trying to write the correct regular expression in PHP to parse a string (written by some user) to build a request. It can be as complex as :
name = 'benjo' and (surname = 'benny' or surname = 'bennie') or age = 4
Later I'll parse the string to build mySQL queries. For now, I'm just trying to find the correct regular expression to parse this string into an array that could look like :
$result = array(
0 => name = 'benjo',
1 => and
2 => array(
0 => surname = 'benny',
1 => or,
2 => surname = 'bennie',
),
3 => age = 4
);
I've thought about using recursive functions, and my regular expression for now is :
"#\((([^()]+|(?R))*)\)|(ou[^()])|(et[^()])#",
which of course does not work.
I'll be glad if someone could help, I'm getting kinda stuck here ! :) Tks, Romain
LET'S CHANGE THE CHALLENGE ! :) OK, now let's make it a bit more simple. What would it take with a regular expression and adding the constraint that we stay on "level one" !! No nested parenthesis, just one level, but still as many AND/ORs... Would that change anything in favor or REGEXPs ? (I really would like to avoid writing my mini parser although that sounds really interesting...
Upvotes: 1
Views: 695
Reputation: 56809
Theoretical regular expression is not powerful enough to do parentheses matching. Theoretical regular expression can only take care of left recursion/right recursion rules. Middle recursion rules is cannot be expressed with regular expression (e.g. <exp> -> "(" <exp> ")"
).
Regex in programming languages, however, implements features which allow regex to exceed the power of regular grammar. For example, backreference in regex allows one to write a regex which matches a non context-free languages. However, even with backreference, it's still not possible to balance parentheses with regex.
As PCRE library supports recursive regex via subroutine call feature, it is technically possible to parse such an expression with regex. However, unless you can write the regex yourself, which means that you understand what you are doing and can modify the regex to suit your needs, you should just write your own parser. Otherwise, you will end up with an unmaintainable mess.
(?(DEFINE)
(?<string>'[^']++')
(?<int>\b\d+\b)
(?<sp>\s*)
(?<key>\b\w+\b)
(?<value>(?&string)|(?&int))
(?<exp>(?&key) (?&sp) = (?&sp) (?&value))
(?<logic>\b (?:and|or) \b)
(?<main>
(?<token> \( (?&sp) (?&main) (?&sp) \) | (?&exp) )
(?:
(?&sp) (?&logic) (?&sp)
(?&token)
)*
)
)
(?:
^ (?&sp) (?= (?&main) (?&sp) $ )
|
(?!^) \G
(?&sp) (?&logic) (?&sp)
)
(?:
\( (?&sp) (?<m_main>(?&main)) (?&sp) \)
|
(?<m_key>(?&key)) (?&sp) = (?&sp) (?<m_value>(?&value))
)
The regex above should be use with preg_match_all
, and placed between delimiter with x flag (free spacing mode): /.../x
.
For each match:
m_main
capturing group has content, put the content through another round of matching.m_key
and m_value
capturing group.The (?(DEFINE)...)
block allows you to define named capturing groups for use in subroutine calls separately from the main pattern.
(?(DEFINE)
(?<string>'[^']++') # String literal
(?<int>\b\d+\b) # Integer
(?<sp>\s*) # Whitespaces between tokens
(?<key>\b\w+\b) # Field name
(?<value>(?&string)|(?&int)) # Field value
(?<exp>(?&key) (?&sp) = (?&sp) (?&value)) # Simple expression
(?<logic>\b (?:and|or) \b) # Logical operators
(?<main> # <token> ( <logic> <token> )*
# A token can contain a simple expression, or open a parentheses (...)
# When we open a parentheses, we recurse into the main pattern again
(?<token> \( (?&sp) (?&main) (?&sp) \) | (?&exp) )
(?:
(?&sp) (?&logic) (?&sp)
(?&token)
)*
)
)
The rest of the pattern is based on this technique to match all <token>
s in <token> ( <logic> <token> )*
with global matching operation.
The last part of the regex, while can be written as (?&token)
, is expanded to match the field name and value in the simple expressions.
(?:
\( (?&sp) (?<m_main>(?&main)) (?&sp) \)
|
(?<m_key>(?&key)) (?&sp) = (?&sp) (?<m_value>(?&value))
)
Upvotes: 4