Reputation: 341
I'm trying to match measurements in English input text, using Antlr 3.2 and Java1.6. I've got lexical rules like the following:
fragment
MILLIMETRE
: 'millimetre' | 'millimetres'
| 'millimeter' | 'millimeters'
| 'mm'
;
MEASUREMENT
: MILLIMETRE | CENTIMETRE | ... ;
I'd like to be able to accept any combination of upper- and lowercase input and - more importantly - just return a single lexical token for all the variants of MILLIMETRE. But at the moment, my AST contains 'millimetre', 'millimeters', 'mm' etc. just as in the input text.
After reading http://www.antlr.org/wiki/pages/viewpage.action?pageId=1802308, I think I need to do something like the following:
tokens {
T_MILLIMETRE;
}
fragment
MILLIMETRE
: ('millimetre' | 'millimetres'
| 'millimeter' | 'millimeters'
| 'mm') { $type = T_MILLIMETRE; }
;
However, when I do this, I get the following compiler errors in the Java code generated by Antlr:
cannot find symbol
_type = T_MILLIMETRE;
I tried the following instead:
MEASUREMENT
: MILLIMETRE { $type = T_MILLIMETRE; }
| ...
but then MEASUREMENT is not matched anymore.
The more obvious solution with a rewrite rule:
MEASUREMENT
: MILLIMETRE -> ^(T_MILLIMETRE MILLIMETRE)
| ...
causes an NPE:
java.lang.NullPointerException at org.antlr.grammar.v2.DefineGrammarItemsWalker.alternative(DefineGrammarItemsWalker.java:1555).
Making MEASUREMENT into a parser rule gives me the dreaded "The following token definitions can never be matched because prior tokens match the same input" error.
By creating a parser rule
measurement : T_MILLIMETRE | ...
I get the warning "no lexer rule corresponding to token: T_MILLIMETRE". Antlr runs though, but it still gives me the input text in the AST and not T_MILLIMETRE.
I'm obviously not yet seeing the world the way Antlr does. Can anyone give me any hints or advice please?
Steve
Upvotes: 3
Views: 1128
Reputation: 170227
Note that fragment
rules only "live" inside the lexer and cease to exist in the parser. For example:
grammar Measurement;
options {
output=AST;
}
parse
: (m=MEASUREMENT {
String contents = $m.text;
boolean isMeasurementType = $m.getType() == MeasurementParser.MEASUREMENT;
System.out.println("contents="+contents+", isMeasurementType="+isMeasurementType);
})+ EOF
;
MEASUREMENT
: MILLIMETRE
;
fragment
MILLIMETRE
: 'millimetre'
| 'millimetres'
| 'millimeter'
| 'millimeters'
| 'mm'
;
SPACE
: (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
;
with input text:
"millimeters mm"
will print:
contents=millimeters, isMeasurementType=true
contents=mm, isMeasurementType=true
in other words: the type MILLIMETRE
does not exist, they're all of type MEASUREMENT
.
Upvotes: 0
Reputation: 170227
Here's a way to do that:
grammar Measurement;
options {
output=AST;
}
tokens {
ROOT;
MM;
CM;
}
parse
: measurement+ EOF -> ^(ROOT measurement+)
;
measurement
: Number MilliMeter -> ^(MM Number)
| Number CentiMeter -> ^(CM Number)
;
Number
: '0'..'9'+
;
MilliMeter
: 'millimetre'
| 'millimetres'
| 'millimeter'
| 'millimeters'
| 'mm'
;
CentiMeter
: 'centimetre'
| 'centimetres'
| 'centimeter'
| 'centimeters'
| 'cm'
;
Space
: (' ' | '\t' | '\r' | '\n'){$channel=HIDDEN;}
;
It can be tested with the following class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("12 millimeters 3 mm 456 cm");
MeasurementLexer lexer = new MeasurementLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
MeasurementParser parser = new MeasurementParser(tokens);
MeasurementParser.parse_return returnValue = parser.parse();
CommonTree tree = (CommonTree)returnValue.getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
which produces the following DOT file:
digraph {
ordering=out;
ranksep=.4;
bgcolor="lightgrey"; node [shape=box, fixedsize=false, fontsize=12, fontname="Helvetica-bold", fontcolor="blue"
width=.25, height=.25, color="black", fillcolor="white", style="filled, solid, bold"];
edge [arrowsize=.5, color="black", style="bold"]
n0 [label="ROOT"];
n1 [label="MM"];
n1 [label="MM"];
n2 [label="12"];
n3 [label="MM"];
n3 [label="MM"];
n4 [label="3"];
n5 [label="CM"];
n5 [label="CM"];
n6 [label="456"];
n0 -> n1 // "ROOT" -> "MM"
n1 -> n2 // "MM" -> "12"
n0 -> n3 // "ROOT" -> "MM"
n3 -> n4 // "MM" -> "3"
n0 -> n5 // "ROOT" -> "CM"
n5 -> n6 // "CM" -> "456"
}
which corresponds to the tree:
(image created by http://graph.gafol.net/)
EDIT
Note that the following:
measurement
: Number m=MilliMeter {System.out.println($m.getType() == MeasurementParser.MilliMeter);}
| Number CentiMeter
;
will always print true
, regardless if the "contents" of the (millimeter) tokens are mm
, millimetre
, millimetres
, ...
Upvotes: 1