julius
julius

Reputation: 875

Simple grammar for fluentd?

I am new to antlr4 and I am trying to create grammar to parse a fluentd config files to a tree. Can you point me to what I am doing wrong here?

The fluentd syntax looks a lot like Apache's (pseudo-xml, shell-style comments, kv-pairs in a tag), for example:

# Receive events from 24224/tcp
<source>
  @type forward
  port 24224
</source>

# example
<match>
    # file or memory
    buffer_type    file
    <copy>
      file /path
    </copy>
</match>

This is my grammar so far:

grammar Fluentd;

// root element
content: (entry | comment)*;

entry: '<' name tag? '>' (entry | comment | param)* '<' '/' close_ '>';

name: NAME;

close_: NAME;

tag: TAG;

comment: '#' NL;

param: name value NL;

value: ANY;



ANY: .*?;

NL: ('\r'?'\n'|'\n') -> skip;

TAG: ('a'..'z' | 'A'..'Z' | '_' | '0'..'9'| '$' |'.' | '*' | '{' | '}')+;

NAME: ('a'..'z'| 'A..Z' | '@' | '_' | '0'..'9')+;

WS: (' '|'\t') -> skip;

...And it fails miserably on the above input:

line 2:2 mismatched input 'Receive' expecting NL
line 3:1 missing NAME at 'source'
line 4:8 mismatched input 'forward' expecting ANY
line 6:2 mismatched input 'source' expecting NAME
line 8:2 mismatched input 'example' expecting NL
line 9:1 missing NAME at 'match'
line 10:6 mismatched input 'file' expecting NL
line 12:2 mismatched input 'match' expecting NAME

Upvotes: 0

Views: 111

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170158

The first thing you must realise is that the lexer works independently from the parser. The lexer simply creates tokens by trying to match as much characters as possible. If two or more lexer rules match the same amount of characters, the rule defined first will "win".

Having said that, the input source can therefor never be tokenised as a NAME since the TAG rule also matches this, and is defined before NAME.

A solution to this could be:

tag  : SIMPLE_ID | TAG;
name : SIMPLE_ID | NAME;

SIMPLE_ID : [a-zA-Z_0-9]+ ;
TAG       : [a-zA-Z_0-9$.*{}]+ ;
NAME      : [a-zA-Z_0-9@]+ ;

That way, foobar would become a SIMPLE_ID, foo.bar a TAG and @mu a NAME.

There are more things incorrect in your grammar:

  • in your lexer, you're skipping NL tokens, but you're using them in parser rules as well: you can't do that (since such tokens will never be created)

  • ANY: .*?; can potentially match an empty string (of which there are an infinite amount): lexer rules must always match at least 1 character! However, if you change .*? to .+?, it will always match just 1 character since you made it match ungreedy (the trailing ?). And you cannot do .+ because then it will match the entire input. You should do something like this:

    // Use a parser rule to "glue" all single ANY tokens to each other
    any : ANY+ ;
    
    // all other lexer rules
    
    // This must be very last rule!
    ANY : . ;
    

    If you don't define ANY as the last rule, input like X would not be tokenised as a TAG, but an an ANY token (remember my first paragraph).

  • the rule comment: '#' NL; makes no sense: a comment isn't a # followed by a line break. I'd expect a lexer rule for such a thing:

    COMMENT : '#' ~[\r\n]* -> skip;
    

    And there's not need to include a linebreak in this rule: these are already handled in NL.

Upvotes: 1

Related Questions