Reputation: 1720
I'm writing a parser for my own language. I'm trying to parse the phrase
Number a is 10;
which is basically equivalent to int a = 10;
.
It should match the variable_def
rule. When I run it, I get the error
line 1:0 extraneous input 'Number' expecting {<EOF>, 'while', ';', 'if', 'function', TYPE, 'global', 'room', ID}
line 1:9 mismatched input 'is' expecting '('
This is my grammar:
grammar Script;
@header {
package script;
}
// PARSER
program
:
block EOF
;
block
:
(
statement
| functionDecl
)*
;
statement
:
(variable_def
| functionCall
| ifStatement
| forStatement
| whileStatement) ';'
;
whileStatement
:
'while' '(' expression ')' '{' (statement)* '}'
;
forStatement
:
;
ifStatement
:
'if' '(' expression ')' '{' statement* '}'
(
(
'else' '{' statement* '}'
)
|
(
'else' ifStatement
)
)?
;
functionDecl
:
'function' ID
(
'('
(
TYPE ID
)?
(
',' TYPE ID
)* ')'
)?
(
'returns' RETURN_TYPE
)? '{' statement* '}'
;
functionCall
:
ID '(' exprList? ')'
;
exprList
:
expression
(
',' expression
)*
;
variable_def
:
TYPE assignment
| GLOBAL variable_def
| ROOM variable_def
;
expression
:
'-' expression # unaryMinusExpression
| '!' expression # notExpression
| expression '^' expression # powerExpression
| expression '*' expression # multiplyExpression
| expression '/' expression # divideExpression
| expression '%' expression # modulusExpression
| expression '+' expression # addExpression
| expression '-' expression # subtractExpression
| expression '>=' expression # gtEqExpression
| expression '<=' expression # ltEqExpression
| expression '>' expression # gtExpression
| expression '<' expression # ltExpression
| expression '==' expression # eqExpression
| expression '!=' expression # notEqExpression
| expression '&&' expression # andExpression
| expression '||' expression # orExpression
| expression IN expression # inExpression
| NUMBER # numberExpression
| BOOLEAN # boolExpression
| functionCall # functionCallExpression
| '(' expression ')' # expressionExpression
;
assignment
:
ID ASSIGN expression
;
// LEXER
RETURN_TYPE
:
TYPE
| 'Nothing'
;
TYPE
:
'Number'
| 'String'
| 'Anything'
| 'Boolean'
| 'Growable'? 'List' 'of' TYPE
;
GLOBAL
:
'global'
;
ROOM
:
'room'
;
ASSIGN
:
'is'
(
'a'
| 'an'
| 'the'
)?
;
EQUAL
:
'is'?
(
'equal'
(
's'
| 'to'
)?
| 'equivalent' 'to'?
| 'the'? 'same' 'as'?
)
;
IN
:
'in'
;
BOOLEAN
:
'true'
| 'false'
;
NUMBER
:
'-'? INT '.' INT EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? '.' INT EXP? // -.35, .35e5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -3, 45
;
fragment
EXP
:
[Ee] [+\-]? INT
;
fragment
INT
:
'0'
| [1-9] [0-9]*
;
STRING
:
'"'
(
' ' .. '~'
)* '"'
;
ID
:
(
'a' .. 'z'
| 'A' .. 'Z'
| '_'
)
(
'a' .. 'z'
| 'A' .. 'Z'
| '0' .. '9'
| '_'
)*
;
fragment
JAVADOC_COMMENT
:
'/*' .*? '*/'
;
fragment
LINE_COMMENT
:
(
'//'
| '#'
) ~( '\r' | '\n' )*
;
COMMENT
:
(
LINE_COMMENT
| JAVADOC_COMMENT
) -> skip
;
WS
:
[ \t\n\r]+ -> skip
;
How can I fix this error?
Upvotes: 2
Views: 2785
Reputation: 170148
The main reason is because in your current grammar, the TYPE
token will never be created because RETURN_TYPE
matches a TYPE
too and is defined before TYPE
(and has therefor precedence over it).
Also, you're doing too much in the lexer. As soon as you start gluing words together in the lexer, it's a sign you should be making those rules parser rules instead.
And white spaces might be skipped by the lexer, but only from parser rules. Take your ASSIGN
rule for example:
ASSIGN
: 'is' ( 'a' | 'an' | 'the' )?
;
This rule will not match the string "is a"
(a space between "is"
and "a"
), it will only match "isa"
, "isan"
and "isthe"
. The solution: create a parser rule from it:
assign
: 'is' ( 'a' | 'an' | 'the' )?
;
which is equivalent to:
assign
: 'is' ( 'a' | 'an' | 'the' )?
;
IS : 'is';
A : 'a';
AN : 'an';
THE : 'the';
...
ID : [a-zA-Z_] [a-zA-Z_0-9]*;
This will cause the tokens 'is'
, 'a'
, 'an'
and 'the'
to never be matched as an ID
token. So the following source will fail as a proper assignment:
Number a is 42;
because the 'a'
is tokenized as an A
token, not an ID
.
To work around this, you could add the following parser rule:
id
: ( ID | A | AN | IS | THE | ... )
;
and use that rule in other parser rules instead of ID
.
A quick demo would look like this:
grammar Script;
// PARSER
program
: block EOF
;
block
: ( statement | functionDecl )*
;
statement
: ( variable_def
| functionCall
| ifStatement
| forStatement
| whileStatement
)
';'
;
whileStatement
: 'while' '(' expression ')' '{' statement* '}'
;
forStatement
:
;
ifStatement
: 'if' '(' expression ')' '{' statement* '}'
( ( 'else' '{' statement* '}' ) | ( 'else' ifStatement ) )?
;
functionDecl
: 'function' id ( '(' ( type id )? ( ',' type id )* ')' )?
( 'returns' return_type )? '{' statement* '}'
;
functionCall
: id '(' exprList? ')'
;
exprList
: expression ( ',' expression )*
;
variable_def
: type assignment
| GLOBAL variable_def
| ROOM variable_def
;
expression
: '-' expression # unaryMinusExpression
| '!' expression # notExpression
| expression '^' expression # powerExpression
| expression '*' expression # multiplyExpression
| expression '/' expression # divideExpression
| expression '%' expression # modulusExpression
| expression '+' expression # addExpression
| expression '-' expression # subtractExpression
| expression '>=' expression # gtEqExpression
| expression '<=' expression # ltEqExpression
| expression '>' expression # gtExpression
| expression '<' expression # ltExpression
| expression '==' expression # eqExpression
| expression '!=' expression # notEqExpression
| expression '&&' expression # andExpression
| expression '||' expression # orExpression
| expression IN expression # inExpression
| NUMBER # numberExpression
| BOOLEAN # boolExpression
| functionCall # functionCallExpression
| '(' expression ')' # expressionExpression
;
assignment
: id assign expression
;
return_type
: type
| 'Nothing'
;
type
: TYPE
| 'Growable'? 'List' OF TYPE
;
assign
: 'is' ( A | AN | THE )?
;
equal
: 'is'? ( EQUAL ( S
| TO
)?
| EQUIVALENT TO?
| THE? SAME AS?
)
;
id
: ( ID | OF | A | AN | EQUAL | S | EQUIVALENT | TO | THE | SAME | AS )
;
// LEXER
// Some keyword you might want to match as an identifier too:
OF : 'of';
A : 'a';
AN : 'an';
EQUAL : 'equal';
S : 's';
EQUIVALENT : 'equivalent';
TO : 'to';
THE : 'the';
SAME : 'same';
AS : 'as';
COMMENT
: ( LINE_COMMENT | JAVADOC_COMMENT ) -> skip
;
WS
: [ \t\n\r]+ -> skip
;
TYPE
: 'Number'
| 'String'
| 'Anything'
| 'Boolean'
;
GLOBAL
: 'global'
;
ROOM
: 'room'
;
IN
: 'in'
;
BOOLEAN
: 'true'
| 'false'
;
NUMBER
: '-'? INT '.' INT EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? '.' INT EXP? // -.35, .35e5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -3, 45
;
STRING
: '"' .*? '"'
;
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
fragment EXP
: [Ee] [+\-]? INT
;
fragment INT
: '0'
| [1-9] [0-9]*
;
fragment JAVADOC_COMMENT
: '/*' .*? '*/'
;
fragment LINE_COMMENT
: ( '//' | '#' ) ~( '\r' | '\n' )*
;
Upvotes: 1
Reputation: 17455
The particular error occurs because in the lexer part of the grammar TYPE term clashes with RETURN_TYPE lexer term. There're other mistakes as well, but the problem showcase may be stripped down to just following:
grammar Script;
program
:
block EOF
;
block
:
(
statement
| functionDecl
)*
;
statement
:
(
variable_def
) ';'
;
functionDecl
:
'function' ID
(
'returns' RETURN_TYPE
)?
'{' statement* '}'
;
variable_def
:
TYPE assignment
;
expression
:
NUMBER # numberExpression
;
assignment
:
ID ASSIGN expression
;
RETURN_TYPE
:
TYPE
| 'Nothing'
;
TYPE
:
'Number'
;
ASSIGN
:
'is'
(
'a'
| 'an'
| 'the'
)?
;
NUMBER
:
'-'? INT // -3, 45
;
fragment
INT
:
'0'
| [1-9] [0-9]*
;
ID
:
(
'a' .. 'z'
| 'A' .. 'Z'
| '_'
)
(
'a' .. 'z'
| 'A' .. 'Z'
| '0' .. '9'
| '_'
)*
;
WS
:
[ \t\n\r]+ -> skip
;
if RETURN_TYPE
is converted into a parser rule, e.g returnType
, then everything goes Ok (for this particular test, as I said your grammar contains other mistakes like this one). This demonstrates the basic princple regarding Antlr (and all other parser generators with lexer and parser separated) behaviour: the lexer is always works in its own context, it can't determine if a particular sequence of symbols is one term or another if both terms share the same sequence of characters. So you have two options: introduce lexer contexts (called modes) or leave on the lexer level only basic and unambiguous entities, and move everything else to parser.
Upvotes: 1