asm
asm

Reputation: 8898

What is the best way to handle overlapping lexer patterns that are sensitive to context?

I'm attempting to write an Antlr grammar for parsing the C4 DSL. However, the DSL has a number of places where the grammar is very open ended, resulting in overlapping lexer rules (in the sense that multiple token rules match).

For example, the workspace rule can have a child properties element defining <name> <value> pairs. This is a valid file:

workspace "Name" "Description" {
    properties {
        xyz "a string property"
        nonstring nodoublequotes
    }
}

The issue I'm running into is that the rules for the <name> and <value> have to be very broad, basically anything except whitespace. Also, properties with spaces with double quotes will match my STRING token.

My current solution is the grammar below, using property_element: BLOB | STRING; to match values and BLOB to match names. Is there a better way here? If I could make context sensitive lexer tokens I would make NAME and VALUE tokens instead. In the actual grammar I define case insensitive name tokens for thinks like workspace and properties. This allows me to easily match the existing DSL semantics, but raises the wrinkle that a property name or value of workspace will tokenize to K_WORKSPACE.

grammar c4mce;

workspace : 'workspace' (STRING (STRING)?)?  '{' NL workspace_body '}';

workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';

property_element: BLOB property_value;
property_value : BLOB | STRING;

BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;

This tokenizes to

[@0,0:8='workspace',<'workspace'>,1:0]
[@1,10:15='"Name"',<STRING>,1:10]
[@2,17:29='"Description"',<STRING>,1:17]
[@3,31:31='{',<'{'>,1:31]
[@4,32:32='\n',<NL>,1:32]
[@5,37:46='properties',<'properties'>,2:4]
[@6,48:48='{',<'{'>,2:15]
[@7,49:49='\n',<NL>,2:16]
[@8,58:60='xyz',<BLOB>,3:8]
[@9,62:80='"a string property"',<STRING>,3:12]
[@10,81:81='\n',<NL>,3:31]
[@11,90:98='nonstring',<BLOB>,4:8]
[@12,100:113='nodoublequotes',<BLOB>,4:18]
[@13,114:114='\n',<NL>,4:32]
[@14,119:119='}',<'}'>,5:4]
[@15,120:120='\n',<NL>,5:5]
[@16,121:121='}',<'}'>,6:0]
[@17,122:122='\n',<NL>,6:1]
[@18,123:122='<EOF>',<EOF>,7:0]

This is all fine, and I suppose it's as much as the DSL grammar gives me. Is there a better way to handle situations like this? As I expand the grammar I expect to have a lot of BLOB tokens simply because creating a narrower token in the lexer would be pointless because BLOB would match instead.

Upvotes: 0

Views: 221

Answers (1)

Mike Lischke
Mike Lischke

Reputation: 53552

This is the classic keywords-as-identifier problem. If you want that a specific char combination, which is lexed as keyword, can also be used as a normal identifier in certain places, then you have to list this keyword as possible alternative. For example:

property_element: (BLOB | K_WORKSPACE) property_value;
property_value : BLOB | STRING | K_WORKSPACE;

Upvotes: 1

Related Questions