Julian Solarte
Julian Solarte

Reputation: 585

How to tokenize a word in multiple lines in ANTLR4

I want to tokenize the next word "SINGULAR EXECUTIVE OF MINIMUM QUANTIA" wrote in multiple lines. It is pretty simple if you have the full word in one line

foo bar foo bar foo bar SINGULAR EXECUTIVE OF MINIMUM QUANTIA foo bar foo bar foo bar foo bar
foo bar foo bar foo bar foo bar foo bar foo bar foo bar foo bar foo barfoo bar foo bar foo bar

but I can not tokenize it when I have the word split into two lines

foo bar foo bar foo bar SINGULAR EXECUTIVE OF 
MINIMUM QUANTIA foo bar foo bar foo bar foo bar
foo bar foo bar foo bar foo bar foo bar foo bar foo bar foo bar 

This is my lexer

SPECIALWORD:S I N G U L A R ' ' E X E C U T I V E ' ' O F ' ' M I N I M U M ' ' Q U A N T I A 
fragment A:('a'|'A'|'á'|'Á');
......
......
fragment Z:('z'|'Z');
WORDUPPER: UCASE_LETTER UCASE_LETTER+;
WORDLOWER: LCASE_LETTER LCASE_LETTER+;
WORDCAPITALIZE: UCASE_LETTER LCASE_LETTER+;
LCASE_LETTER: 'a'..'z' | 'ñ' | 'á' | 'é' | 'í' | 'ó' | 'ú';
UCASE_LETTER: 'A'..'Z' | 'Ñ' | 'Á' | 'É' | 'Í' | 'Ó' | 'Ú';
INT: DIGIT+;
DIGIT: [0-9];  
WS : [ \t\r\n]+ -> skip;
ERROR: . ;

I have tried using line break into lexer rule

SPECIALWORD:S I N G U L A R ' ' E X E C U T I V E ' ' O F [\n] M I N I M U M ' ' Q U A N T I A

but it does not work, I guess because the lexer tokenize line by line.

Upvotes: 1

Views: 162

Answers (2)

Mike Lischke
Mike Lischke

Reputation: 53337

So what you actually want is to allow a combination of the 5 words to become a certain token, while allowing an arbitrary number of whitespaces between them. This is actually the default work principle of ANTLR4 based parsers. Your attempt to put this all into one single lexer token is what makes things complicated.

Instead define your (key) words as:

SINGLUAR_SYMBOL: S I N G U L A R;
EXECUTIVE_SYBOL: E X E C U T I V E;
OF_SYMBOL: O F;
MINIMUM_SYMBOL: M I N I M U M;
QUANTIA_SYMBOL: Q U A N T I A;

and define a parser rule to parse these as a special sentence:

singularExec: SINGLUAR_SYMBOL EXECUTIVE_SYBOL OF_SYMBOL MINIMUM_SYMBOL QUANTIA_SYMBOL;

Together with your WS rule that will match any combination of whitespaces between the individiual symbols.

Upvotes: 1

sepp2k
sepp2k

Reputation: 370132

Your revised rule matches if there is exactly one \n and no other character between "OF" and "MINIMUM". However, your input contains a space before the line break. Thus the rule does not match.

If you remove the space from the input or you adjust your rule to allow spaces before the line break, it will match.

You'll probably want to use either [ \n]+ to allow an arbitrary number of spaces and/or line breaks (you might want to throw in \t and \r as well for good measure) or ' '* '\n' ' '* if you still want to restrict it to a single line break, but allow any number of spaces around it.

That said you'll probably have an easier time if you make each word its own token.

Upvotes: 0

Related Questions