Reputation: 1959
I am a complete newcomer to ANTLR.
I have the following ANTLR grammar:
grammar DrugEntityRecognition;
// Parser Rules
derSentence : ACTION (INT | FRACTION | RANGE) FORM TEXT;
// Lexer Rules
ACTION : 'TAKE' | 'INFUSE' | 'INJECT' | 'INHALE' | 'APPLY' | 'SPRAY' ;
INT : [0-9]+ ;
FRACTION : [1] '/' [1-9] ;
RANGE : INT '-' INT ;
FORM : ('TABLET' | 'TABLETS' | 'CAPSULE' | 'CAPSULES' | 'SYRINGE') ;
TEXT : ('A'..'Z' | WHITESPACE | ',')+ ;
WHITESPACE : ('\t' | ' ' | '\r' | '\n' | '\u000C')+ -> skip ;
And when I try to parse a sentence as follows:
String upperLine = line.toUpperCase();
org.antlr.v4.runtime.CharStream stream = new ANTLRInputStream(upperLine);
DrugEntityRecognitionLexer lexer = new DrugEntityRecognitionLexer(stream);
lexer.removeErrorListeners();
lexer.addErrorListener(ThrowingErrorListener.INSTANCE);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
DrugEntityRecognitionParser parser = new DrugEntityRecognitionParser(tokenStream);
try {
DrugEntityRecognitionParser.DerSentenceContext ctx = parser.derSentence();
StringBuilder sb = new StringBuilder();
sb.append("ACTION: ").append(ctx.ACTION());
sb.append(", ");
sb.append("FORM: ").append(ctx.FORM());
sb.append(", ");
sb.append("INT: ").append(ctx.INT());
sb.append(", ");
sb.append("FRACTION: ").append(ctx.FRACTION());
sb.append(", ");
sb.append("RANGE: ").append(ctx.RANGE());
System.out.println(upperLine);
System.out.println(sb.toString());
} catch (ParseCancellationException e) {
//e.printStackTrace();
}
An example of the input to lexer:
take 10 Tablet (25MG) by oral route every week
In this case ACTION node is not getting populated, but take
is getting recognized only as a TEXT node, not an ACTION node. 10
is being recognized as an INT node, however.
How can I modify this grammar to work correctly, where ACTION node is populated correctly (as well as FORM, which is not being populated either)?
Upvotes: 0
Views: 285
Reputation: 53307
There are several problems in your grammar:
(25MG)
is not valid input and the parser returns in an error state.Also, when in doubt, always print your token sequence from the token source to see if the input has actually been tokenized as you expect. Start there to fix your grammar before you go to the parser.
About case sensitivity: typically (if your language is case-insensitive) you have rules like these:
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
...
to match a letter in either case and then define your keywords so:
ACTION : T A K E | I N F U S E | I N J E C T | I N H A L E | A P P L Y | S P R A Y;
Upvotes: 1