Hakanai
Hakanai

Reputation: 12688

What do I do in ANTLR if I want to parse something which is extremely configurable?

I'm writing a grammar to recognise simple mathematical expressions. I have it working for English.

Now I want to expand the grammar to support i18n. Therefore, the digits, radix separator and so forth depend on the user's locale.

What is the best way to do this in ANTLR?

What I'm currently considering is something like this:

lexer grammar ExpressionLexer;

options {
    superClass = AbstractLexer;
}

DIGIT: . {isDigit(getText())}?;
// ... and so on for other tokens ...
abstract class AbstractLexer(input: CharStream, symbols: Symbols) extends Lexer(input) {
    fun isDigit(codePoint: Int): Boolean = symbols.isDigit(codePoint)
    // ... and so on for other tokens ...
}

Alternative approaches I am considering:

(b) I gather every possible digit and every possible separator in every possible locale, and jam all of those into the one grammar, and then check isDigit after that.

(c) I make a different lexer for every single numbering system and somehow align them all to emit the same token types in the same order, so they can be swapped in and out (sounds like it might be the most pure and correct solution? but not the most enjoyable.)

(And on a side tangent, how do people in European countries which use comma for the decimal separator deal with writing function calls with more than one parameter?)

Upvotes: 1

Views: 99

Answers (2)

Bart Kiers
Bart Kiers

Reputation: 170257

Note that since ANTLR v4.7 and up, there is more possible w.r.t. Unicode inside ANTLR's lexer grammar: https://github.com/antlr/antlr4/blob/master/doc/unicode.md

So you could define a lexer rule like this:

DIGIT
 : [\p{Digit}]
 ;

which will match both ٣ and 3.

Upvotes: 1

Mike Lischke
Mike Lischke

Reputation: 53492

I recommend doing that in two steps:

  1. Parse the main language structure (e.g. (digits+ separator)+), regardless of what a digit or a separator is.

  2. Do a semantic check against the user's locale if the digits that were given actually match what's allowed. Same for the separator.

This way you don't need to do all kind of hacks, add platform code and so on.

For your side question: programming usually uses the english language, including the number format. In strings you can use any format you want, but that doesn't affect the surrounding code.

Upvotes: 1

Related Questions