Reputation: 437
I'm trying to use Antlr4 to process the following from a file:
process example(test){
run $test say hi
}
My grammar looks like the following:
grammar example;
main: process* EOF;
processCall: processName '(' processArg ')';
process: ('process' | 'Process' | 'PROCESS') processName '(' processArg ') {' IDENTIFIER?
processArgReplaces IDENTIFIER? '}';
processArgReplaces: IDENTIFIER? '$' processArg IDENTIFIER?;
processName: WORD;
processArg: (WORD ',')* WORD;
WORD: [a-zA-Z0-9?_]+;
IDENTIFIER: [a-zA-Z] [ a-zA-Z0-9?_]+;
BS: [\r\n\t\f]+ -> skip;
But my output gives me no viable alternative at input 'process example name('
The problem is I need to support spaces in certain areas.
process name(arg){
[anything here is one token]
OR
anotherprocess(arg) [comes out as {anotherprocess} and {arg}]
}
I've tried changing the IDENTIFIER around as I think it's taking over the match before process
has a chance to, but wouldn't the explicit token mean that line wouldn't be just generic words?
Upvotes: 2
Views: 176
Reputation: 53337
In cases like this it is always extremely helpful to print the list of tokens the lexer recognized. In your case you will get:
[@0,0:14='process example',<11>,1:0]
[@1,15:15='(',<1>,1:15]
[@2,16:19='test',<10>,1:16]
[@3,20:20=')',<2>,1:20]
[@4,27:30='run ',<11>,2:4]
[@5,31:31='$',<8>,2:8]
[@6,32:42='test say hi',<11>,2:9]
[@7,44:44='}',<7>,3:0]
[@8,46:45='<EOF>',<-1>,4:0]
As you can see the input process example
is recognized as a single token, while you expected process
to be recognized as a keyword. The reason for this misbehavior is the space in the IDENTIFIER
rule. This is going to create a lot of problems. In our writing system the space char is a separator between words. You cannot sometimes use it like that and in other situations treat it as part of a larger token. Instead I recommend you change the grammar like that (which also converts all implicit tokens to explicit tokens, avoiding so other trouble):
grammar Example;
start: process* EOF;
processCall: processName OPEN_PAR processArg CLOSE_PAR;
process:
PROCESS processName OPEN_PAR processArg CLOSE_PAR OPEN_CURLY IDENTIFIER? processArgReplaces IDENTIFIER? CLOSE_CURLY
;
processArgReplaces: IDENTIFIER? DOLLAR processArg IDENTIFIER?;
processName: IDENTIFIER;
processArg: (IDENTIFIER COMMA)* IDENTIFIER;
OPEN_PAR: '(';
CLOSE_PAR: ')';
OPEN_CURLY: '{';
CLOSE_CURLY: '}';
COMMA: ',';
DOLLAR: '$';
PROCESS: [pP] [rR] [oO] [cC] [eE] [sS] [sS];
IDENTIFIER: [a-zA-Z] [a-zA-Z0-9?_]+;
WS: [ \r\n\t\f]+ -> skip;
Which gives you a nice parse tree:
In your description you mention a part as [anything here is one token]
. You probably want to skip all that, as you are not interested in it. However, I recommend that you still parse that part (and just leave it alone). It requires to implement that double role of the whitespaces and you may later need it anyway.
Upvotes: 2