Reputation: 875
I am new to antlr4 and I am trying to create grammar to parse a fluentd config files to a tree. Can you point me to what I am doing wrong here?
The fluentd syntax looks a lot like Apache's (pseudo-xml, shell-style comments, kv-pairs in a tag), for example:
# Receive events from 24224/tcp
<source>
@type forward
port 24224
</source>
# example
<match>
# file or memory
buffer_type file
<copy>
file /path
</copy>
</match>
This is my grammar so far:
grammar Fluentd;
// root element
content: (entry | comment)*;
entry: '<' name tag? '>' (entry | comment | param)* '<' '/' close_ '>';
name: NAME;
close_: NAME;
tag: TAG;
comment: '#' NL;
param: name value NL;
value: ANY;
ANY: .*?;
NL: ('\r'?'\n'|'\n') -> skip;
TAG: ('a'..'z' | 'A'..'Z' | '_' | '0'..'9'| '$' |'.' | '*' | '{' | '}')+;
NAME: ('a'..'z'| 'A..Z' | '@' | '_' | '0'..'9')+;
WS: (' '|'\t') -> skip;
...And it fails miserably on the above input:
line 2:2 mismatched input 'Receive' expecting NL
line 3:1 missing NAME at 'source'
line 4:8 mismatched input 'forward' expecting ANY
line 6:2 mismatched input 'source' expecting NAME
line 8:2 mismatched input 'example' expecting NL
line 9:1 missing NAME at 'match'
line 10:6 mismatched input 'file' expecting NL
line 12:2 mismatched input 'match' expecting NAME
Upvotes: 0
Views: 111
Reputation: 170158
The first thing you must realise is that the lexer works independently from the parser. The lexer simply creates tokens by trying to match as much characters as possible. If two or more lexer rules match the same amount of characters, the rule defined first will "win".
Having said that, the input source
can therefor never be tokenised as a NAME
since the TAG
rule also matches this, and is defined before NAME
.
A solution to this could be:
tag : SIMPLE_ID | TAG;
name : SIMPLE_ID | NAME;
SIMPLE_ID : [a-zA-Z_0-9]+ ;
TAG : [a-zA-Z_0-9$.*{}]+ ;
NAME : [a-zA-Z_0-9@]+ ;
That way, foobar
would become a SIMPLE_ID
, foo.bar
a TAG
and @mu
a NAME
.
There are more things incorrect in your grammar:
in your lexer, you're skip
ping NL
tokens, but you're using them in parser rules as well: you can't do that (since such tokens will never be created)
ANY: .*?;
can potentially match an empty string (of which there are an infinite amount): lexer rules must always match at least 1 character! However, if you change .*?
to .+?
, it will always match just 1 character since you made it match ungreedy (the trailing ?
). And you cannot do .+
because then it will match the entire input. You should do something like this:
// Use a parser rule to "glue" all single ANY tokens to each other
any : ANY+ ;
// all other lexer rules
// This must be very last rule!
ANY : . ;
If you don't define ANY
as the last rule, input like X
would not be tokenised as a TAG
, but an an ANY
token (remember my first paragraph).
the rule comment: '#' NL;
makes no sense: a comment isn't a #
followed by a line break. I'd expect a lexer rule for such a thing:
COMMENT : '#' ~[\r\n]* -> skip;
And there's not need to include a linebreak in this rule: these are already handled in NL
.
Upvotes: 1