Writing a PHP query parser with regular expressions

Question

I'm trying to write the correct regular expression in PHP to parse a string (written by some user) to build a request. It can be as complex as :

name = 'benjo' and (surname = 'benny' or surname = 'bennie') or age = 4

Later I'll parse the string to build mySQL queries. For now, I'm just trying to find the correct regular expression to parse this string into an array that could look like :

$result = array(
0 => name = 'benjo',
1 => and
2 => array(
    0 => surname = 'benny',
    1 => or,
    2 => surname = 'bennie',
    ),
3 => age = 4
);

I've thought about using recursive functions, and my regular expression for now is :

"#$(([^()]+|(?R))*)$|(ou[^()])|(et[^()])#",

which of course does not work.

I'll be glad if someone could help, I'm getting kinda stuck here ! :) Tks, Romain

LET'S CHANGE THE CHALLENGE ! :) OK, now let's make it a bit more simple. What would it take with a regular expression and adding the constraint that we stay on "level one" !! No nested parenthesis, just one level, but still as many AND/ORs... Would that change anything in favor or REGEXPs ? (I really would like to avoid writing my mini parser although that sounds really interesting...

nhahtdh · Accepted Answer

Theoretical regular expression is not powerful enough to do parentheses matching. Theoretical regular expression can only take care of left recursion/right recursion rules. Middle recursion rules is cannot be expressed with regular expression (e.g. -> "(" ")").

Regex in programming languages, however, implements features which allow regex to exceed the power of regular grammar. For example, backreference in regex allows one to write a regex which matches a non context-free languages. However, even with backreference, it's still not possible to balance parentheses with regex.

As PCRE library supports recursive regex via subroutine call feature, it is technically possible to parse such an expression with regex. However, unless you can write the regex yourself, which means that you understand what you are doing and can modify the regex to suit your needs, you should just write your own parser. Otherwise, you will end up with an unmaintainable mess.

(?(DEFINE)
  (?'[^']++')
  (?\b\d+\b)
  (?\s*)
  (?\b\w+\b)
  (?(?&string)|(?&int))
  (?(?&key) (?&sp) = (?&sp) (?&value))
  (?\b (?:and|or) \b)
  (?
    (? $ (?&sp) (?&main) (?&sp) $ | (?&exp) )
    (?:
      (?&sp) (?&logic) (?&sp)
      (?&token) 
    )*
  )
)
(?:
  ^ (?&sp) (?= (?&main) (?&sp) $ )
  |
  (?!^) \G
  (?&sp) (?&logic) (?&sp)
)
(?:
  $ (?&sp) (?(?&main)) (?&sp) $
  |
  (?(?&key)) (?&sp) = (?&sp) (?(?&value))
)

Demo on regex101

The regex above should be use with preg_match_all, and placed between delimiter with x flag (free spacing mode): /.../x.

For each match:

If m_main capturing group has content, put the content through another round of matching.
Otherwise, get the key and value in m_key and m_value capturing group.

Explanation

The (?(DEFINE)...) block allows you to define named capturing groups for use in subroutine calls separately from the main pattern.

(?(DEFINE)
  (?'[^']++')  # String literal
  (?\b\d+\b)      # Integer
  (?\s*)           # Whitespaces between tokens
  (?\b\w+\b)      # Field name
  (?(?&string)|(?&int)) # Field value
  (?(?&key) (?&sp) = (?&sp) (?&value)) # Simple expression
  (?\b (?:and|or) \b) # Logical operators
  (?             #  (   )*
    # A token can contain a simple expression, or open a parentheses (...)
    # When we open a parentheses, we recurse into the main pattern again
    (? $ (?&sp) (?&main) (?&sp) $ | (?&exp) )
    (?:
      (?&sp) (?&logic) (?&sp)
      (?&token) 
    )*
  )
)

The rest of the pattern is based on this technique to match all s in ( )* with global matching operation.

The last part of the regex, while can be written as (?&token), is expanded to match the field name and value in the simple expressions.

(?:
  $ (?&sp) (?(?&main)) (?&sp) $
  |
  (?(?&key)) (?&sp) = (?&sp) (?(?&value))
)

Writing a PHP query parser with regular expressions

Answers (1)

Explanation

Related Questions