rianjs
rianjs

Reputation: 7944

Do all character sets have ASCII in common?

The reason I ask is that there's a "standard" for affix files that says to read the first line of a file, and it will tell you how the file is encoded:

The first line specifies the character set used for both the 
wordlist and the affix file (should be all uppercase). 

For example:

SET ISO8859-1

That strikes me as being both unreasonable and unreliable, unless all character sets have the 7-bit ASCII range in common, which would allow you to "taste" up to the first newline byte(s): 0xA or 0xD.

But I have no idea if the ASCII range is common to all character sets or not.

Upvotes: 1

Views: 129

Answers (1)

dan04
dan04

Reputation: 90995

No. EBCDIC is non-ASCII based, and is still used in IBM mainframe-based software environments with extreme backwards-compatibility requirements.

More popular are UTF-16 and UTF-32, which although ASCII-based, are backwards-incompatible due to all the extra 00 bytes.

Still, there are only a few ways to encode the Basic Latin alphabet. (What distinguishes most of the hundreds of character encodings that exist are their handling of accented and non-Latin letters.) So, the program that reads these files only needs to handle a few possible ways of encoding the word SET:

  • 53 45 54 for ASCII-based encodings (Windows-1252, UTF-8, etc.)
  • E2 C5 E3 for EBCDIC-based encodings (if these are considered worth supporting at all)
  • 00 53 00 45 00 54 for UTF-16BE
  • 53 00 45 00 54 00 for UTF-16LE
  • 00 00 00 53 00 00 00 45 00 00 00 54 for UTF-32BE
  • 53 00 00 00 45 00 00 00 54 00 00 00 for UTF-32LE

The decoder could simply look for them all.

Upvotes: 3

Related Questions