Ron
Ron

Reputation: 1336

ANTLR: Combination of tokens

I have a question, I am searching for about an hour now. A given ANTLR-lexer rule consists of 2 (or more) sub-rules. The Lexer now produces separate AST-nodes.

Example:

[...]
variable: '$' CamelCaseIdentifier;
CamelCaseIdentifier: ('a'..'z') Identifier*;
Identifier: ('a'..'z' | 'A' .. 'Z' | '0'..'9')+;
[...]

With the given input of [...]$a[...] the result is ..., $, a, ...

I am looking for a way to tell the lexer, that these rules should not be separated: ..., $a, ...

Could anyone help me out?

Upvotes: 1

Views: 2449

Answers (3)

john k
john k

Reputation: 6615

I am a beginner at compilers and Antlr, but from my limited understanding, a upper case (lexer) rule is only for regular expressions. Lower case (parser) rules can also double as lexer rules (see [1]). So it shouldn't matter if variable is upper or lower case, right?

Anyways, I may be wrong, but wouldn't it be simpler to just do this:

[...]
variable: '$' ('a'..'z' | 'A' .. 'Z') ALPHANUM*;
ALPHANUM: ('a'..'z' | 'A' .. 'Z' | '0'..'9');
[...]

?

If you plan on reusing ('a'..'z' | 'A' .. 'Z'), then you should do:

[...]
variable: '$' ALPHA ALPHANUM*;
fragment ALPHA: ('a'..'z' | 'A' .. 'Z')
ALPHANUM: (ALPHA | '0'..'9');
[...]

Apologies if this is completely off base, I am still learning.

[1] https://theantlrguy.atlassian.net/wiki/spaces/ANTLR3/pages/2687210/Quick+Starter+on+Parser+Grammars+-+No+Past+Experience+Required

Upvotes: 0

Bart Kiers
Bart Kiers

Reputation: 170158

Parser rules start with a lowercase letter and lexer rules with an upper case. When you output as an AST, each individual token in a parser rule will become a separate node, so you'll want to make the variable rule a lexer rule instead of a parser rule:

Variable            : '$' CamelCaseIdentifier;
CamelCaseIdentifier : ('a'..'z') Identifier*;
Identifier          : ('a'..'z' | 'A' .. 'Z' | '0'..'9')+;

But if you do it like this, the input 123456 will be tokenized as an Identifier, which is probably not what you want. Besides, the Identifier rule is better named AlphaNum. And if you make a fragment rule of it, you make sure the lexer will never produce any AlphaNum tokens on itself, but will only use AlphaNum's for other lexer rules (like your CamelCaseIdentifier rule). If you also want a rule that matches an Identifier, do something like this:

Variable            : '$' (CamelCaseIdentifier | Identifier);
CamelCaseIdentifier : 'a'..'z' AlphaNum*;
Identifier          : 'A'..'Z' AlphaNum*;

// a fragment rule can't be used inside parser rules, only in lexer rules
fragment AlphaNum   : 'a'..'z' | 'A' .. 'Z' | '0'..'9';

Upvotes: 2

eran
eran

Reputation: 6921

maybe try to uppercase all rule names?

Edited: With the example

grammar Dummy;

prog : VARIABLE*;

VARIABLE: '$' CAMELCASEIDENTIFIER;
CAMELCASEIDENTIFIER: ('a'..'z') IDENTIFIER*;
IDENTIFIER: ('a'..'z' | 'A' .. 'Z' | '0'..'9')+;


WS: (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN; };

Upvotes: 0

Related Questions