Reputation: 7944
The reason I ask is that there's a "standard" for affix files that says to read the first line of a file, and it will tell you how the file is encoded:
The first line specifies the character set used for both the
wordlist and the affix file (should be all uppercase).
For example:
SET ISO8859-1
That strikes me as being both unreasonable and unreliable, unless all character sets have the 7-bit ASCII range in common, which would allow you to "taste" up to the first newline byte(s): 0xA
or 0xD
.
But I have no idea if the ASCII range is common to all character sets or not.
Upvotes: 1
Views: 129
Reputation: 90995
No. EBCDIC is non-ASCII based, and is still used in IBM mainframe-based software environments with extreme backwards-compatibility requirements.
More popular are UTF-16 and UTF-32, which although ASCII-based, are backwards-incompatible due to all the extra 00
bytes.
Still, there are only a few ways to encode the Basic Latin alphabet. (What distinguishes most of the hundreds of character encodings that exist are their handling of accented and non-Latin letters.) So, the program that reads these files only needs to handle a few possible ways of encoding the word SET
:
53 45 54
for ASCII-based encodings (Windows-1252, UTF-8, etc.)E2 C5 E3
for EBCDIC-based encodings (if these are considered worth supporting at all)00 53 00 45 00 54
for UTF-16BE53 00 45 00 54 00
for UTF-16LE00 00 00 53 00 00 00 45 00 00 00 54
for UTF-32BE53 00 00 00 45 00 00 00 54 00 00 00
for UTF-32LEThe decoder could simply look for them all.
Upvotes: 3