How to preserve whitespace in ANTLR4 when reading an string?

Question

I am creating a grammar which will be used to create a lexical analyser and parser for C#. This will take the below input and will output SQL.

(path.path.path="how to do something")

at the moment I am ignoring whitespace by using:

WS : [ ]+ -> skip; // Skips whitespace

The problem is, when I'm reading the content within the quotes, I need to preserve the whitespace as it will be used in a search. How do I go about doing this? Thank you.

EDIT 1

Here is my current attempt at parsing strings:

TEXT            :   [a-zA-Z_]+;

There isn't much as I have just run into this problem, but I can't find a solution that I understand how to implement.

I have also added

@lexer::members{
    //Lexer members

    //Used to preserve whitespace when reading in statements
    boolean ignore=true;
}

I have seen something similar to this dotted about. Where the value of ignore will determine whether the whitespace will skip. I have also changed my whitespace rule to:

WS              :   [ 	
]+ {if(ignore) skip();};

But I'm unsure how I would set ignore to false before scanning over the statement and how I would change it back to true after I have finished.

Edit 2

I have copied over the entirety of my grammar file:

// Grammar for Search Criteria
grammar SearchGen;

@members{
    //Lexer members

    //Used to preserve whitespace when reading in statements
    boolean ignore=true;
}

r               :   block_clause*
                |   block
                |   statement;

// block = ( statement CLAUSE statement )
block           :   OPEN_BRACKET start_segment+ end_segment CLOSE_BRACKET;

block_clause    :   block clause;

start_segment   :   statement clause;

end_segment     :   statement;

statement       :   OPEN_BRACKET path search_term CLOSE_BRACKET; //Change TEXT to allows for blah.blah="hiv"

path            :   TEXT '.' TEXT '.' TEXT;

clause          :   NOT | OR | AND | WITHIN;

search_term     :   OPERATOR SEARCH_TYPE;

OPEN_BRACKET    :   '(';

CLOSE_BRACKET   :   ')';

UNDERSCORE      :   '_';

SEARCH_TYPE     :   '"' (~["\] | '\' .)* '"';

OPERATOR        :   EQUALS | GREATER_THAN | LESS_THAN | AMP; //Maybe put the amp and quotes in TEXT/

GREATER_THAN    :   '>' | '>';

LESS_THAN       :   '<' | '<';

QUOTE           :   '"' | '&quot;';

EQUALS          :   '=';

AMP             :   '&' | '&';

NOT             :   'NOT' | 'not';

OR              :   'OR' | 'or';

AND             :   'AND' | 'and';

WITHIN          :   'WITHIN' | 'within';



//Possible problem : If a keyword that the user is looking for matches one of the above tokens

TEXT            :   [a-zA-Z_]+; // Include Underscore

DIGIT           :   [0-9]+;

// yyyy-mm-dd

DATE            : YEAR'-'MONTH'-'DAY;

YEAR            :   [1-2][(0-9)][(0-9)][(0-9)];

MONTH           :   [0][1-9] | [1-9] | [1][(0-2)];

DAY             :   [0][1-9] | [1-2][0-9] | [3][0-1];

WS              :   [ 	
]+ ->skip;               // Skips whitespace

sepp2k · Accepted Answer

So it looks like you handle strings in your first alternative of SEARCH_TYPE, namely:

SEARCH_TYPE     :   '"'TEXT'"'

Now the problem with that rule isn't that it ignores spaces - it's that spaces aren't allowed at all because TEXT does not match spaces. So if you entered something like " hello ", you wouldn't get a string without the spaces, you'd get a syntax error because that input does not match the '"'TEXT'"' pattern. Only "hello" would be a valid string according to that rule. Characters other than letters or underscores are also not allowed, which is different than how strings usually work.

Presumably you want to allow anything inside double-quotes strings except for double quotes (and in most programming languages there's also some way to escape double quotes). So we can just use an inverted character class that matches anything except double quotes:

SEARCH_TYPE: '"' ~'"'* '"';

Now to allow escaping, we might also allow a backslash followed by any character (including a double quote):

SEARCH_TYPE: '"' (~["\] | '\' .)* '"';

Note that this also allows empty strings, which your original rule did not.

So now our strings can actually contain spaces without producing a syntax error. So how do we prevent spaces being ignored? We don't because we don't need to. WS : [ ]+ ->skip; simply means that if the lexer were to produce a WS token, it just skips to the next token instead. This will not affect what happens inside other lexer rules. In other words: The WS rules skips whitespace between tokens, not inside tokens. So whitespace being ignored is simply not a problem at all.

PS: You grammar also contains a QUOTE token that you never use, which looks like a mistake. Further the YEAR, MONTH and DAY rules should probably be declared as fragments as they can never be matched on their own anyway.

How to preserve whitespace in ANTLR4 when reading an string?

Answers (1)

Related Questions