matejuh
matejuh

Reputation: 406

antlr html pcdata

Im trying to write very simple HTML parser with ANTLR and Im facing problem, that ~ rule which should match all until specified character is not working.

My lexer grammar:

lexer grammar HtmlParserLexer;

HTML: OHTML PCDATA CHTML;

PCDATA :(~'<') ; //match all until <

OHTML: '<html>';

CHTML: '</html>';

Im trying to match:

<html>foo bar</html>

Error from Eclipse ANTLR plugin Interpreter:

MismatchedTokenException: line 1:7 mismatched input UNKNOW expecting '<'

Which means, that my grammar ignore PCDATA rule and I dont know why. Thanks in advance for your help.

Upvotes: 1

Views: 147

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170257

The rule PCDATA :(~'<') ; matches a single character other than '<'. You'll need to repeat it once or more: PCDATA :(~'<')+ ; (notice the +).

You may also want to allow <html></html> (nothing in between<html> and </html>). In that case, you shouldn't change PCDATA :(~'<')+ ; into PCDATA :(~'<')* ;, but do this instead:

HTML: OHTML PCDATA? CHTML;

PCDATA : (~'<')+ ;

because you shouldn't create lexer rules that could potentially match an empty string.

Upvotes: 3

Related Questions