Reputation: 8898
I'm attempting to write an Antlr grammar for parsing the C4 DSL. However, the DSL has a number of places where the grammar is very open ended, resulting in overlapping lexer rules (in the sense that multiple token rules match).
For example, the workspace rule can have a child properties element defining <name> <value>
pairs. This is a valid file:
workspace "Name" "Description" {
properties {
xyz "a string property"
nonstring nodoublequotes
}
}
The issue I'm running into is that the rules for the <name>
and <value>
have to be very broad, basically anything except whitespace. Also, properties with spaces with double quotes will match my STRING
token.
My current solution is the grammar below, using property_element: BLOB | STRING;
to match values and BLOB
to match names. Is there a better way here? If I could make context sensitive lexer tokens I would make NAME
and VALUE
tokens instead. In the actual grammar I define case insensitive name tokens for thinks like workspace
and properties
. This allows me to easily match the existing DSL semantics, but raises the wrinkle that a property name or value of workspace
will tokenize to K_WORKSPACE
.
grammar c4mce;
workspace : 'workspace' (STRING (STRING)?)? '{' NL workspace_body '}';
workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';
property_element: BLOB property_value;
property_value : BLOB | STRING;
BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;
This tokenizes to
[@0,0:8='workspace',<'workspace'>,1:0]
[@1,10:15='"Name"',<STRING>,1:10]
[@2,17:29='"Description"',<STRING>,1:17]
[@3,31:31='{',<'{'>,1:31]
[@4,32:32='\n',<NL>,1:32]
[@5,37:46='properties',<'properties'>,2:4]
[@6,48:48='{',<'{'>,2:15]
[@7,49:49='\n',<NL>,2:16]
[@8,58:60='xyz',<BLOB>,3:8]
[@9,62:80='"a string property"',<STRING>,3:12]
[@10,81:81='\n',<NL>,3:31]
[@11,90:98='nonstring',<BLOB>,4:8]
[@12,100:113='nodoublequotes',<BLOB>,4:18]
[@13,114:114='\n',<NL>,4:32]
[@14,119:119='}',<'}'>,5:4]
[@15,120:120='\n',<NL>,5:5]
[@16,121:121='}',<'}'>,6:0]
[@17,122:122='\n',<NL>,6:1]
[@18,123:122='<EOF>',<EOF>,7:0]
This is all fine, and I suppose it's as much as the DSL grammar gives me. Is there a better way to handle situations like this?
As I expand the grammar I expect to have a lot of BLOB
tokens simply because creating a narrower token in the lexer would be pointless because BLOB
would match instead.
Upvotes: 0
Views: 221
Reputation: 53552
This is the classic keywords-as-identifier problem. If you want that a specific char combination, which is lexed as keyword, can also be used as a normal identifier in certain places, then you have to list this keyword as possible alternative. For example:
property_element: (BLOB | K_WORKSPACE) property_value;
property_value : BLOB | STRING | K_WORKSPACE;
Upvotes: 1