Curbing ANTLR4 greediness (Building ANTLR4 Grammar for existing DSL)

Question

I already have a DSL and would like to build ANTLR4 grammar for it.

Here is an exaple of that DSL:

rule isC {
    true  when O_M in [5, 6, 17, 34]
    false in other cases
}

rule isContract {
    true  when O_C in ['XX','XY','YY']
    false in other cases
}

rule isFixed {
    true  when F3 ==~ '.*/.*/.*-F.*/.*'
    false in other cases
}

rule temp[1].future {
    false when O_OF in ['C','P']
    true  in other cases
}

rule temp[0].scale {
    10 when O_M == 5 && O_C in ['YX']
    1  in other cases 
}

How the DSL is parsed simply by using regular expressions that have became a total mess - so a grammar is needed.

The way it works is the following: it extracts left (before when) and right parts and they're evaluated by Groovy.

I would still like to have it evaluated by Groovy, but organize the parsing process by using grammar. So, in essence, what I need is to extract these left and right parts using some kind of wildcards.

I unfortunatelly cannot figure out how to do that. Here is what I have so far:

grammar RuleDSL;

rules: basic_rule+ EOF;

basic_rule: 'rule' rule_name '{' condition_expr+ '}';

name: CHAR+;
list_index: '[' DIGIT+ ']';
name_expr: name list_index*;
rule_name: name_expr ('.' name_expr)*;

condition_expr: when_condition_expr | otherwise_condition_expr;

condition: .*?;
result: .*?;
when_condition_expr: result WHEN condition;

otherwise_condition_expr: result IN_OTHER_CASES;

WHEN: 'when';
IN_OTHER_CASES: 'in other cases';


DIGIT: '0'..'9';
CHAR: 'a'..'z' | 'A'..'Z';
SYMBOL: '?' | '!' | '&' | '.' | ',' | '(' | ')' | '[' | ']' | '\' | '/' | '%' 
      | '*' | '-' | '+' | '=' | '<' | '>' | '_' | '|' | '"' | '\'' | '~';


// Whitespace and comments

WS: [ 	
\u000C]+ -> skip;
COMMENT: '/*' .*? '*/' -> skip;

This grammar is "too" greedy, and only one rule is processed. I mean, if I listen to parsing with

@Override
public void enterBasic_rule(Basic_ruleContext ctx) {
    System.out.println("ENTERING RULE");
}

@Override
public void exitBasic_rule(Basic_ruleContext ctx) {
    System.out.println(ctx.getText());
    System.out.println("LEAVING RULE");
}

I have the following as output

ENTERING RULE
-- tons of text
LEAVING RULE

How I can make it less greedy, so if I parse this given input, I'll get 5 rules? The greediness comes from condition and result I suppose.

UPDATE: It turned out that skipping whitespaces wasn't the best idea, so after a while I ended up with the following: link to gist

Thanks 280Z28 for the hint!

Sam Harwell · Accepted Answer

Instead of using .*? in your parser rules, try using ~'}'* to ensure that those rules won't try to read past the end of the rule.

Also, you skip whitespace in your lexer but use CHAR+ and DIGIT+ in your parser rules. This means the following are equivalent:

rule temp[1].future
rule t e m p [ 1 ] . f u t u r e

Beyond that, you made in other cases a single token instead of 3, so the following are not equivalent:

true  in other cases
true  in  other cases

You should probably start by making the following lexer rules, and then making the CHAR and DIGIT rules fragment rules:

ID : CHAR+;
INT : DIGIT+;

Curbing ANTLR4 greediness (Building ANTLR4 Grammar for existing DSL)

Answers (1)

Related Questions