How to parse tokens of long lexer rule that cannot be converted into parser rule?

Question

I am trying to parse this with ANTLR4:

> A Request [AR]
Commments might have many lines here
Line 2
 
- A Response [A]
- The other response [B]
Response can also have lines here.

> Request [A]
- Responce

The following code parses it very well:

grammar Response;

prog: (request | response)+ EOF;

request: REQUEST TEXT*;
response: RESPONSE TEXT*;
 
REQUEST: '>' TEXT '[' ID ']';
RESPONSE: '-' TEXT ('[' ID ']')?;
 
ID: [a-zA-Z] [a-zA-Z0-9._]*;
TEXT: ~[
]+;
 
EMPTY: [ 	
]+ -> skip;

This is a good result. However I would like to parse separately the ID and TEXT. Because these are tokens in a long lexer rule, it seems this is not supported.

As I understand, usually in this case you can replace the lexer rules REQUEST and RESPONSE into parser rules like request_rule and response_rule.

But this does not work here, as then the TEXT lexer rule will match each and every line. For example, if I replace REQUEST and RESPONSE to ruleREQUEST and ruleRESPONSE:

I am trying to figure out how to proceed... It seems that the only way is to make the code far more complicated using a number of popMode and pushMode, as described here:

https://github.com/antlr/antlr4/issues/2229 (incorrect lexer rule precedence with "not" rules)

Is there any simple way, based on the original antlr4 code to get the TEXT and ID values in C# Antlr4.Runtime.Standard? Other then that, the code works perfectly.

kaby76 · Accepted Answer

TEXT is greedy, so it matches above all other lexer rules. You will need to make it not greedy by adding a '?' operator after the '+'.

Once you do that, however, the parser rules will need to be changed to allow different tokens.

Here is a grammar that may work instead. It works for your input, but you may need to make further changes.

grammar Response;

prog: (request | response)+ EOF;
request: request_rule text*;
response: response_rule text*;
request_rule: '>' text '[' ID ']';
response_rule: '-' text ('[' ID ']')?;
text: (ID | TEXT)+;
ID: [a-zA-Z] [a-zA-Z0-9._]*;
GT: '>';
LP: '[';
RP: ']';
DS: '-';
TEXT: ~[
]+?;
EMPTY: [ 	
]+ -> skip;

How to parse tokens of long lexer rule that cannot be converted into parser rule?

Answers (1)

Related Questions