Reputation: 45
I am writing an ANTLR grammar to recognize HTML block-level elements within plain text. Here is a relevant snippet, limited to the div tag:
grammar Test;
blockElement
: div
;
div
: '<' D I V HTML_ATTRIBUTES? '>' (blockElement | TEXT)* '</' D I V '>'
;
D : ('d' | 'D') ;
I : ('i' | 'I') ;
V : ('v' | 'V') ;
HTML_ATTRIBUTES
: WS (~( '<' | '\n' | '\r' | '"' | '>' ))+
;
TEXT
: (. | '\r' | '\n')
;
fragment WS
: (' ' | '\t')
;
The TEXT token is supposed to represent anything that is no block-level element, such as plain text or inline tags (e. g. <b><\b>
). When I test it on nested block elements, like:
<div level_0><div level_1></div></div>
it parses them correctly. However, as soon as I add some random text, it throws a MismatchedTokenException(0!=0) right after having consumed the first TEXT token, e. g. the capital T in:
<div level_0>This is some random text</div>
Any suggestions? Am I doing something conceptually wrong? I am using ANTLR v. 3.2 and doing the testing with ANTLRWorks v. 1.4.
Thank you
Upvotes: 1
Views: 426
Reputation: 170257
I recommend not testing your grammar with ANTLRWorks: error messages are easily missed in the console and it might therefor interpret your test input not as you expect. Do it with a custom created class like this:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("<div level_0>This is some random text</div>");
TestLexer lexer = new TestLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
Sparser.parse());
}
}
Now, the following rule is not correct:
TEXT
: (. | '\r' | '\n')
;
The .
already matches both \r
and \n
, so it should be:
TEXT
: .
;
When changing that, you can create a parser & lexter, compile all .java files and run the Main class:
java -cp antlr-3.2.jar org.antlr.Tool Test.g javac -cp antlr-3.2.jar *.java java -cp .:antlr-3.2.jar Main
which will produce the following error:
line 1:15 mismatched input 'i' expecting '</'
because the i
from This
is being tokenized by the rule I : ('i' | 'I') ;
.
There are more problems with your current approach:
HTML_ATTRIBUTES
does too much: you should instead have ATTRIBUTE
, =
and VALUE
rules and then move the plural (html attributes) to your parser instead;<
and >
which is incorrect (the can contain them, although it is not recommend).I'd start over if I were you. If you want, I'm willing to propose a start: just says so.
Upvotes: 3