Reputation: 3248
I'm trying to make a tree-sitter parser, so that IDEs (in this case, Vim) can parse and do more advanced manipulation of Ada program text, such as extract-subprogram and rename-variable. But there seem to be some problems defining the character set.
In the Ada 2012 Reference Manual, I found a list of vague category descriptions, of the form 'Any character whose General Category is X' which means that for instance, besides the underscore, all of these ( ‿ ⁀ ⁔ ︳ ︴ ﹍ ﹎ ﹏ _) are also allowed in an identifier, which seems absurd, and GNAT rejects with 'illegal character'. The list is prefaced by this statement:
"The actual set of graphic symbols used by an implementation for the visual representation of the text of an Ada program is not specified."
Does that really mean there's no way to know which characters should be accepted?
Two pages on, these examples are explicitly given as valid identifiers, and yet GNAT 2021 rejects them:
procedure Main is
Πλάτων : constant := 12; -- Plato
Чайковский : constant := 12; -- Tchaikovsky
θ, φ : constant := 12; -- Angles
begin
null;
end Main;
$ gprbuild
using project file foo.gpr
Compile
[Ada] main.adb
main.adb:2:04: error: declaration expected
main.adb:2:05: error: illegal character
main.adb:3:04: error: declaration expected
main.adb:3:05: error: illegal character
main.adb:4:05: error: illegal character
gprbuild: *** compilation phase failed
Where is the actual character set for Ada programs defined? Has GNAT 2021 got it wrong?
An example program using Unicode characters in identifiers is below for your experimentation. Note that the use of wide characters in the literal string is outside the scope of the question.
main.adb:
with Ada.Wide_Text_IO; use Ada.Wide_Text_IO;
procedure Main is
δεδομένα_πράμα : constant Wide_String := "Ο Πλάτων θα ενέκρινε";
begin
Put_Line (Δεδομένα_πράμα);
end Main;
foo.gpr
project foo is
for Source_Dirs use (".");
for Main use ("main.adb");
package Compiler is
for Default_Switches ("ada") use ("-gnatW8", "-gnatiw");
end Compiler;
end foo;
To build & run:
gprbuild
./main
Upvotes: 2
Views: 374
Reputation: 51
All Ada versions since Ada 2005 have required that implementations support UTF-8 source code, however for Ada 83 & 95 compatibility don't require it to be the default encoding. GNAT's default source encoding is Latin-1, although it helpfully switches to UTF-8 if a byte-order mark is found. To explicitly specify file encoding, you can pass the -gnatW8
flag, or one of a number of other options.
However, while that allows UTF-8 in source files, identifiers are still limited to Latin-1 in GNAT, you must also pass the -gnatiw
flag to allow wide characters in identifiers. It seems that GNAT does not default to it because you can craft very bizarre identifiers (as you noted), but also because identifiers would no longer be properly case-insensitive; GNAT does minimal case folding on any wide character set, other than characters present in other encodings it supports.
ARM § 2.3 specify the requirements for an identifier:
identifier ::= identifier_start {identifier_start | identifier_extend}
,
where identifier_start
can be summarized as anything in Unicode general category L, and the remaining characters can be numbers, punctuation_connector
s, decimal marks, and non-whitespace combining marks—with the additional restriction of “An identifier shall not contain two consecutive characters in category punctuation_connector
, or end with a character in that category. ”
Going beyond your question, note that despite all these flags, strings are still encoded as Latin-1 (conflictingly, string literals are UTF-8, just not the underlying string :/). You'll need to use Ada.Strings.UTF_Encoding
, Wide_Wide_String
s, and/or a library such as VSS for Unicode string handling.
Upvotes: 5