MismatchedTokenException in HTML subset grammar

Question

I am writing an ANTLR grammar to recognize HTML block-level elements within plain text. Here is a relevant snippet, limited to the div tag:

grammar Test;

blockElement
  : div
  ;

div
  : '<' D I V HTML_ATTRIBUTES? '>' (blockElement | TEXT)* ''
  ;

D : ('d' | 'D') ;
I : ('i' | 'I') ;
V : ('v' | 'V') ;

HTML_ATTRIBUTES
  : WS (~( '<' | '
' | '
' | '"' | '>' ))+
  ;

TEXT
  : (. | '
' | '
')
  ;

fragment WS
  : (' ' | '	')
  ;

The TEXT token is supposed to represent anything that is no block-level element, such as plain text or inline tags (e. g. <\b>). When I test it on nested block elements, like:

it parses them correctly. However, as soon as I add some random text, it throws a MismatchedTokenException(0!=0) right after having consumed the first TEXT token, e. g. the capital T in:

This is some random text

Any suggestions? Am I doing something conceptually wrong? I am using ANTLR v. 3.2 and doing the testing with ANTLRWorks v. 1.4.

Thank you

Bart Kiers · Accepted Answer

I recommend not testing your grammar with ANTLRWorks: error messages are easily missed in the console and it might therefor interpret your test input not as you expect. Do it with a custom created class like this:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("This is some random text");
        TestLexer lexer = new TestLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        Sparser.parse());
    }
}

Now, the following rule is not correct:

TEXT
  :  (. | '
' | '
')
  ;

The . already matches both and , so it should be:

TEXT
  :  .
  ;

When changing that, you can create a parser & lexter, compile all .java files and run the Main class:

java -cp antlr-3.2.jar org.antlr.Tool Test.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main

which will produce the following error:

line 1:15 mismatched input 'i' expecting '



because the i from This is being tokenized by the rule I : ('i' | 'I') ;.

There are more problems with your current approach:


HTML_ATTRIBUTES does too much: you should instead have ATTRIBUTE, = and VALUE rules and then move the plural (html attributes) to your parser instead;
now your attributes cannot contain < and > which is incorrect (the can contain them, although it is not recommend).


I'd start over if I were you. If you want, I'm willing to propose a start: just says so.

MismatchedTokenException in HTML subset grammar

Answers (1)

Related Questions