daniel0mullins
daniel0mullins

Reputation: 1959

ANTLR parsing is not finding correct lexer parts

I am a complete newcomer to ANTLR.

I have the following ANTLR grammar:

grammar DrugEntityRecognition;

// Parser Rules 

derSentence : ACTION (INT | FRACTION | RANGE) FORM TEXT;

// Lexer Rules

ACTION : 'TAKE' | 'INFUSE' | 'INJECT' | 'INHALE' | 'APPLY' | 'SPRAY' ;

INT : [0-9]+ ;

FRACTION : [1] '/' [1-9] ;

RANGE : INT '-' INT ;

FORM : ('TABLET' | 'TABLETS' | 'CAPSULE' | 'CAPSULES' | 'SYRINGE') ;

TEXT : ('A'..'Z' | WHITESPACE | ',')+ ;

WHITESPACE : ('\t' | ' ' | '\r' | '\n' | '\u000C')+ -> skip ;

And when I try to parse a sentence as follows:

String upperLine = line.toUpperCase();
org.antlr.v4.runtime.CharStream stream = new ANTLRInputStream(upperLine);
DrugEntityRecognitionLexer lexer = new DrugEntityRecognitionLexer(stream);
lexer.removeErrorListeners();
lexer.addErrorListener(ThrowingErrorListener.INSTANCE);

CommonTokenStream tokenStream = new CommonTokenStream(lexer);
DrugEntityRecognitionParser parser = new DrugEntityRecognitionParser(tokenStream);

try {
        DrugEntityRecognitionParser.DerSentenceContext ctx = parser.derSentence();

        StringBuilder sb = new StringBuilder();

        sb.append("ACTION: ").append(ctx.ACTION());
        sb.append(", ");
        sb.append("FORM: ").append(ctx.FORM());
        sb.append(", ");
        sb.append("INT: ").append(ctx.INT());
        sb.append(", ");
        sb.append("FRACTION: ").append(ctx.FRACTION());
        sb.append(", ");
        sb.append("RANGE: ").append(ctx.RANGE());

        System.out.println(upperLine);
        System.out.println(sb.toString());

  } catch (ParseCancellationException e) {
       //e.printStackTrace();
  }

An example of the input to lexer:

take 10 Tablet (25MG)  by oral route  every week

In this case ACTION node is not getting populated, but take is getting recognized only as a TEXT node, not an ACTION node. 10 is being recognized as an INT node, however.

How can I modify this grammar to work correctly, where ACTION node is populated correctly (as well as FORM, which is not being populated either)?

Upvotes: 0

Views: 285

Answers (1)

Mike Lischke
Mike Lischke

Reputation: 53307

There are several problems in your grammar:

  1. Your TEXT rule only matches uppercase letters. Same for ACTION.
  2. You shouldn't mix punctuation and text in a single text rule (here the comma), otherwise you cannot freely allow whitespaces between tokens.
  3. You don't match parentheses at all, hence (25MG) is not valid input and the parser returns in an error state.
  4. You did not check for any syntax errors, to learn what went wrong during recognition.

Also, when in doubt, always print your token sequence from the token source to see if the input has actually been tokenized as you expect. Start there to fix your grammar before you go to the parser.

About case sensitivity: typically (if your language is case-insensitive) you have rules like these:

fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
...

to match a letter in either case and then define your keywords so:

ACTION : T A K E | I N F U S E | I N J E C T |  I N H A L E | A P P L Y  | S P R A Y;

Upvotes: 1

Related Questions