Do all character sets have ASCII in common?

Question

The reason I ask is that there's a "standard" for affix files that says to read the first line of a file, and it will tell you how the file is encoded:

The first line specifies the character set used for both the 
wordlist and the affix file (should be all uppercase). 

For example:

SET ISO8859-1

That strikes me as being both unreasonable and unreliable, unless all character sets have the 7-bit ASCII range in common, which would allow you to "taste" up to the first newline byte(s): 0xA or 0xD.

But I have no idea if the ASCII range is common to all character sets or not.

dan04 · Accepted Answer

No. EBCDIC is non-ASCII based, and is still used in IBM mainframe-based software environments with extreme backwards-compatibility requirements.

More popular are UTF-16 and UTF-32, which although ASCII-based, are backwards-incompatible due to all the extra 00 bytes.

Still, there are only a few ways to encode the Basic Latin alphabet. (What distinguishes most of the hundreds of character encodings that exist are their handling of accented and non-Latin letters.) So, the program that reads these files only needs to handle a few possible ways of encoding the word SET:

53 45 54 for ASCII-based encodings (Windows-1252, UTF-8, etc.)
E2 C5 E3 for EBCDIC-based encodings (if these are considered worth supporting at all)
00 53 00 45 00 54 for UTF-16BE
53 00 45 00 54 00 for UTF-16LE
00 00 00 53 00 00 00 45 00 00 00 54 for UTF-32BE
53 00 00 00 45 00 00 00 54 00 00 00 for UTF-32LE

The decoder could simply look for them all.

Do all character sets have ASCII in common?

Answers (1)

Related Questions