Reputation: 2756
I have been using antlr4 to parse a German document and so far I have done the following to parse the text that includes German characters:
LETTERS:
[a-zA-Z_\u00DC\u00FC\u00D6\u00F6\u00C4\u00E4\u00DF]; // hex unicodes for ÜüÖöÄäß
what is the best way to describe lingual characters of all languages in Unicode in a way that antlr understands, without specifying each language/character individually? say, the french, Arabic, or Chinese, Japanese characters?
Thank you
Upvotes: 4
Views: 669
Reputation: 6001
The best way is to use character ranges corresponding to the desired Unicode classes. Even then, the result can be a bit clumsy. See this worked example.
The raw data available in the Unicode standard's Appendix tables can be stripped and munged into a usable format with just a bit too much effort. ;)
Upvotes: 2