Preeti
Preeti

Reputation: 93

How to find the length of a token in antlr?

I am trying to create a grammar which accepts any character or number or just about anything, provided its length is equal to 1.

Is there a function to check the length?

EDIT

Let me make my question more clear with an example. I wrote the following code:

grammar first;

tokens {
    SET =   'set';
    VAL =   'val';
    UND =   'und';
    CON =   'con';
    ON  =   'on';
    OFF =   'off';
}

@parser::members {
  private boolean inbounds(Token t, int min, int max) {
    int n = Integer.parseInt(t.getText());
    return n >= min && n <= max;
  }
}

parse   :   SET expr;

expr    :   VAL('u'('e')?)? String |
        UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF) |
        CON('n'('e'('c'('t')?)?)?)? oneChar
    ;

CHAR    :   'a'..'z';

DIGIT   :   '0'..'9';

String  :   (CHAR | DIGIT)+;

dot :   .;

oneChar :   dot { $dot.text.length() == 1;} ;

Space  : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};

I want my grammar to do the following things:

  1. Accept commands like: 'set value abc' , 'set underli on' , 'set conn #'. The grammar should be intelligent enough to accept incomplete words like 'underl' instead of 'underline. etc etc.
  2. The third syntax: 'set connect oneChar' should accept any character, but just one character. It can be a numeric digit or alphabet or any special character. I am getting a compiler error in the generated parser file because of this.
  3. The first syntax: 'set value' should accept all the possible strings, even on and off. But when I give something like: 'set value offer', the grammar is failing. I think this is happening because I already have a token 'OFF'.

In my grammar all the three requirements I have listed above are not working fine. Don't know why.

Upvotes: 2

Views: 2542

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170207

There are some mistakes and/or bad practices in your grammar:


#1

The following is not a validating predicate:

{$dot.text.length() == 1;}

A proper validating predicate in ANTLR has a question mark at the end, and the inner code has no semi colon at the end. So it should be:

{$dot.text.length() == 1}?

instead.


#2

You should not be handling these alternative commands:

expr
  :  VAL('u'('e')?)? String 
  |  UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF) 
  |  CON('n'('e'('c'('t')?)?)?)? oneChar
  ;

in a parser rule. You should let the lexer handle this instead. Something like this will do it:

expr
  :  VAL String
  |  UND (ON | OFF)
  |  CON oneChar
  ;

// ...

VAL : 'val' ('u' ('e')?)?;
UND : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;

(also see #5!)


#3

Your lexer rules:

CHAR    :   'a'..'z';
DIGIT   :   '0'..'9';  
String  :   (CHAR | DIGIT)+;

are making things complicated for you. The lexer can produce three different kind of tokens because of this: CHAR, DIGIT or String. Ideally, you should only create String tokens since a String can already be a single CHAR or DIGIT. You can do that by adding the fragment keyword before these rules:

fragment CHAR  : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;

There will now be no CHAR and DIGIT tokens in your token stream, only String tokens. In short: fragment rules are only used inside lexer rules, by other lexer rules. They will never be tokens of their own (and can therefor never appear in any parser rule!).


#4

The rule:

dot :   .;

does not do what you think it does. It matches "any token", not "any character". Inside a lexer rule, the . matches any character but in parser rules, it matches any token. Realize that parser rules can only make use of the tokens created by the lexer.

The input source is first tokenized based on the lexer-rules. After that has been done, the parser (though its parser rules) can then operate on these tokens (not characters!!!). Make sure you understand this! (if not, ask for clarification or grab a book about ANTLR)

- an example -

Take the following grammar:

p : . ;
A : 'a' | 'A';
B : 'b' | 'B';

The parser rule p will now match any token that the lexer produces: which is only a A- or B-token. So, p can only match one of the characters 'a', 'A', 'b' or 'B', nothing else.

And in the following grammar:

prs : . ;
FOO : 'a';
BAR : . ;

the lexer rule BAR matches any single character in the range \u0000 .. \uFFFF, but it can never match the character 'a' since the lexer rule FOO is defined before the BAR rule and captures this 'a' already. And the parser rule prs again matches any token, which is either FOO or BAR.


#5

Putting single characters like 'u' inside your parser rules, will cause the lexer to tokenize an u as a separate token: you don't want that. Also, by putting them in parser rules, it is unclear which token has precedence over other tokens. You should keep all such literals outside your parser rules and make them explicit lexer rules instead. Only use lexer rules in your parser rules.

So, don't do:

pRule  : 'u' ':' String
String : ...

but do:

pRule  : U ':' String
U      : 'u';
String : ...

You could make ':' a lexer rule, but that is of less importance. The 'u' however can also be a String so it must appear as a lexer rule before the String rule.


Okay, those were the most obvious things that come to mind. Based on them, here's a proposed grammar:

grammar first;

parse
  :  (SET expr {System.out.println("expr = " + $expr.text);} )+ EOF
  ;

expr
  :  VAL String    {System.out.print("A :: ");}
  |  UL (ON | OFF) {System.out.print("B :: ");}
  |  CON oneChar   {System.out.print("C :: ");}
  ;

oneChar 
  :  String {$String.text.length() == 1}?
  ;

SET : 'set';
VAL : 'val' ('u' ('e')?)?;
UL  : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
ON  : 'on';
OFF : 'off';

String : (CHAR | DIGIT)+;

fragment CHAR  : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';

Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};

that can be tested with the following class:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        String source = 
                "set value abc  \n" + 
                "set underli on \n" + 
                "set conn x     \n" + 
                "set conn xy      ";
        ANTLRStringStream in = new ANTLRStringStream(source);
        firstLexer lexer = new firstLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        firstParser parser = new firstParser(tokens);
        System.out.println("parsing:\n======\n" + source + "\n======");
        parser.parse();
    }
}

which, after generating the lexer and parser:

java -cp antlr-3.2.jar org.antlr.Tool first.g 
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main

prints the following output:

parsing:
======
set value abc  
set underli on 
set conn x     
set conn xy      
======
A :: expr = value abc
B :: expr = underli on
C :: expr = conn x
line 0:-1 rule oneChar failed predicate: {$String.text.length() == 1}?
C :: expr = conn xy

As you can see, the last command, C :: expr = conn xy, produces an error, as expected.

Upvotes: 8

Related Questions