Reputation: 278
I'm writing a grammar to parse HTSQL syntax and am stuck with how to deal with the reuse of the /
character for both the segment and division operators. The described grammar isn't terribly formal, so I've been following the exact output of the Python implementation, which from a cursory glance seems to be a handwritten parser, rather than using a parser generator - for reference the parser generator I'm currently using is CL-YACC
with CL-LEX
. (FWIW the full thing is here, although likely a bit outdated.)
The one ambiguity I'm struggling with arises due to "/1"
being parsed as '(:COLLECT (:INTEGER "1"))'
, but "/1/2"
being parsed as '(:COLLECT (:OPERATOR / (:INTEGER "1") (:INTEGER "2")))'
, i.e. one is the segment separator, the other one is division; "/1//2"
again is '(:COLLECT (:OPERATOR / (:INTEGER "1") (:COLLECT (:INTEGER "2"))))'
.
The question is thus, how can I deal with this in the grammar specification without resorting to switching to a manual parser? Would a switch to a different parser generator class (instead of LALR(1)) help?
So far I've tried different variations of the grammar, however the fact that the precedences are fixed for the whole grammar also interferes with both interpretations of the slash. The other way I tried was to disambiguate in the lexer, i.e. treating the first slash (in each "group") differently and returning a different symbol, e.g. DIV
- however I couldn't find a good rule and somehow doubt that it exists purely by looking at the lexical structure.
Lastly, I'm tempted to resolve this by diverging from the given parser entirely to make my life easier. Would you consider that more desirable, in the sense that having a predictable grammar is more easily understood?
Exhibit 1, the Python script to examine the parse tree:
import htsql
application = htsql.HTSQL("sqlite:///htsql_demo.sqlite")
global y
y = None
def p(string):
global y
with application:
y = htsql.core.syn.parse.parse(string)
return y
def l(name):
result = []
for c in name:
if c.isupper() and result:
result.append("-")
result.append(c)
return "".join(result)
def keyword(name):
return ":{}".format(name.upper())
def n(expression):
name = expression.__class__.__name__
name = name[:name.find("Syntax")]
return keyword(l(name))
def t(expression):
arguments = [n(expression)]
d = expression.__dict__
if "identifier" in d:
arguments.append(t(expression.identifier))
if "text" in d:
arguments.append("\"{}\"".format(expression.text))
if "symbol" in d:
if not isinstance(expression, (htsql.core.syn.syntax.ProjectSyntax, htsql.core.syn.syntax.FilterSyntax, htsql.core.syn.syntax.CollectSyntax, htsql.core.syn.syntax.DetachSyntax)):
arguments.append(expression.symbol)
if "arm" in d:
arguments.append(t(expression.arm))
if "larm" in d:
arguments.append(t(expression.larm))
if "rarm" in d:
arguments.append(t(expression.rarm))
if "arms" in d:
arguments.extend(t(x) for x in expression.arms)
if "rarms" in d:
arguments.extend(t(x) for x in expression.rarms)
return "({})".format(" ".join(arguments))
# t(p("/school"))
# '(:COLLECT (:IDENTIFIER "school"))
# t(p("/'school'"))
# '(:COLLECT (:STRING "school"))
Exhibit 2, my current parser, which doesn't deal with this correctly:
(defpackage #:cl-htsql
(:use #:cl #:alexandria #:cl-lex #:yacc)
(:import-from #:arnesi #:with-collector))
(eval-when (:compile-toplevel :load-toplevel :execute)
(defun maybe-intern (name &optional (package NIL package-p))
"If NAME is a SYMBOL, return it, otherwise INTERN it."
(cond
((symbolp name)
name)
(package-p
(intern name package))
(T
(intern name))))
(defmacro define-lexer (name &body patterns)
"Shortcut for DEFINE-STRING-LEXER."
`(define-string-lexer ,name
,@(mapcar
(lambda (pattern)
(etypecase pattern
((or symbol string)
(let ((symbol (maybe-intern pattern))
(pattern (string pattern)))
`(,pattern (return (values ',symbol ',symbol)))))
(list
(destructuring-bind (pattern &optional symbol value) pattern
(let* ((symbol (or symbol (intern pattern)))
(value (or value symbol)))
(etypecase symbol
(list
`(,pattern ,symbol))
(symbol
`(,pattern (return (values ',symbol ',value))))))))))
patterns))))
;; parser are results are to be treated immutable
(define-lexer string-lexer
/
("\\|" \|)
("\\&" &)
<=
>=
==
=
!==
!=
!~
!
~
<
>
@
("\\?" ?)
("\\." \.)
("\\(" \()
("\\)" \))
("\\+" +)
-
("\\*" *)
\:
("-?0|[1-9][0-9]*(\\.[0-9]*)?([eE][+-]?[0-9]+)?"
(return (cond
((find #\e $@)
(values 'float $@))
((find #\. $@)
(values 'decimal $@))
(T
(values 'integer $@)))))
("([^\"\\.\\?~\'=<>\\(\\)@\\|\\&/:])+" (return (values 'name $@)))
("\'([^\\\']|\\.)*?\'" (return (values 'string (string-trim "\'" $@))))
("\"([^\\\"]|\\.)*?\"" (return (values 'string (string-trim "\"" $@)))))
(define-parser *expression-parser*
(:muffle-conflicts (44 0))
(:start-symbol query)
(:terminals (|\|| #+(or)div & ! |.| ? / = != !== !~ ~ < > == <= >= \( \) + - * @ name integer decimal float string))
(:precedence ((:left @) (:left ~) (:left |.|) (:left + -) (:left * div) (:left = != == !== ~ !~ < <= > >=) (:left !) (:left &) (:left |\||) (:left ?) (:left /)))
(query
segment)
(segment
(/ segment (lambda (x y) (declare (ignore x)) (if (eq y :skip) '(:skip) `(:collect ,y))))
skip
group)
(skip
((constantly :skip)))
(group
(\( segment \) (lambda (x y z) (declare (ignore x z)) `(:group ,y)))
sieve)
(sieve
(segment ? segment (lambda (x y z) (declare (ignore y)) `(:filter ,x ,z)))
or)
(or
(segment |\|| segment (lambda (x y z) `(:operator ,y ,x ,z)))
and)
(and
(segment & segment (lambda (x y z) `(:operator ,y ,x ,z)))
not)
(not
(! segment (lambda (x y) `(:prefix ,x ,y)))
comparison)
(comparison
(segment = segment (lambda (x y z) `(:operator ,y ,x ,z)))
(segment != segment (lambda (x y z) `(:operator ,y ,x ,z)))
(segment ~ segment (lambda (x y z) `(:operator ,y ,x ,z)))
(segment !~ segment (lambda (x y z) `(:operator ,y ,x ,z)))
(segment == segment (lambda (x y z) `(:operator ,y ,x ,z)))
(segment !== segment (lambda (x y z) `(:operator ,y ,x ,z)))
(segment < segment (lambda (x y z) `(:operator ,y ,x ,z)))
(segment <= segment (lambda (x y z) `(:operator ,y ,x ,z)))
(segment > segment (lambda (x y z) `(:operator ,y ,x ,z)))
(segment >= segment (lambda (x y z) `(:operator ,y ,x ,z)))
addition)
(addition
(segment + segment (lambda (x y z) `(:operator ,y ,x ,z)))
(segment - segment (lambda (x y z) `(:operator ,y ,x ,z)))
multiplication)
(multiplication
(segment * segment (lambda (x y z) `(:operator ,y ,x ,z)))
(segment / segment (lambda (x y z) (declare (ignore y)) `(:operator / ,x ,z)))
composition)
(composition
(segment |.| segment (lambda (x y z) (declare (ignore y)) `(:compose ,x ,z)))
attach)
(attach
(segment @ segment (lambda (x y z) (declare (ignore y)) `(:attach ,x ,z)))
detach)
(detach
(@ segment (lambda (x y) (declare (ignore x)) `(:detach ,y)))
term)
(term
(name (lambda (x) `(:identifier ,x)))
(string (lambda (x) `(:string ,x)))
(number (lambda (x) `(:integer ,x)))
(integer (lambda (x) `(:integer ,x)))
(decimal (lambda (x) `(:decimal ,x)))
(float (lambda (x) `(:float ,x)))))
(defun make-lexer-for-source (source)
"Make a lexer for the SOURCE, either a STRING or a STREAM."
(etypecase source
(string (string-lexer source))
(stream
(flet ((ignore (c)
(declare (ignore c))))
(stream-lexer #'read-line #'string-lexer #'ignore #'ignore)))))
(defun lex-source (source)
"Debug helper to lex a SOURCE into a list of tokens."
(let ((lexer (make-lexer-for-source source)))
(loop
for (x y) = (multiple-value-list (funcall lexer))
while x
collect (list x y))))
(define-condition htsql-parse-error (simple-error) ())
(defun translate-yacc-error (error)
(make-condition
'htsql-parse-error
:format-control "Couldn't parse HTSQL query: ~A."
:format-arguments (list error)))
(defun parse-htsql-query (source)
"Parse SOURCE into a syntax tree. The SOURCE may be either a STRING or
a STREAM."
(handler-case
(parse-with-lexer
(make-lexer-for-source source)
*expression-parser*)
(yacc-parse-error (error)
(error (translate-yacc-error error)))))
;; > (parse-htsql-query "/1/")
;; (:OPERATOR / (:COLLECT (:INTEGER "1")) :SKIP)
;; > (parse-htsql-query "/1/2")
;; (:OPERATOR / (:COLLECT (:INTEGER "1")) (:INTEGER "2"))
Upvotes: 4
Views: 133
Reputation: 241731
If you look at the operator list, you will see that there is another case where the same symbol is used as a binary and a unary operator, with different precedences: unary minus. This is pretty normal in expression languages, and yacc and most yacc derivatives offer a solution: explicit precedence declarations for productions. In yacc, you might write
%token UNARY_MINUS UNARY_SLASH
%left UNARY_SLASH
%left '-' '+'
%left '/' '*'
%left UNARY_MINUS
%%
expr: '/' expr %prec UNARY_SLASH
| expr '*' expr
| expr '/' expr
| expr '+' expr
| expr '-' expr
| '-' expr %prec UNARY_MINUS
I don't know if CL-YACC offers an equivalent; I didn't find anything in the documentation.
You are under no obligation to use precedence declarations, and I'm not even convinced that they are a good idea. It's only slightly more complicated to just say what you mean in an unambiguous way:
term: ID | NUMBER | ... | '(' expr0 ')'
expr0: '/' expr0
| expr1
expr1: expr1 '+' expr2
| expr1 '-' expr2
| expr2
expr2: expr2 '/' expr3
| expr2 '*' expr3
| expr3
expr3: '-' expr3
| term
The above is unambiguous and requires no precedence declarations at all.
The (informal) HTSQL grammar says that the syntax of the "segment" operator is / T
rather than / expr
, but I don't see any indication what a T
might be. To my mind, T
is a term or perhaps an lexpr (T := expr
), which makes 1 / 2
an unlikely candidate. But, as you say, it is informal.
Upvotes: 3