Federico A. Ramponi
Federico A. Ramponi

Reputation: 47075

How would you design an 8-bit encoding?

How would you design an 8-bit encoding of a set of 256 characters from western languages (say, with the same characters as ISO 8859-1) if it had not to be backward-compatible with ASCII?

I'm thinking to rules of thumb like these: if ABC...XYZabc...xyz0123...89 were, in this order, the first characters of the set (codes from 0 to 61), then isalpha(c) would just need the comparison c < 52, isalnum(c) would be c < 62, and so on. If, otherwise, 0123...89 were the first characters, maybe atoi() and the like would be easier to implement.

Another idea: if the letters were sorted like AaBbCcDdEeFf... or aàáâãbcdeèéêëfgh..., I think that dictionary-like sorting of strings would be more efficient.

Finally: is there a rationale behind 0 being the terminator of C strings instead of, say, 255?

Upvotes: 1

Views: 291

Answers (5)

peterchen
peterchen

Reputation: 41106

255 is not a valid character value on a 7 bit system, or might be somewhere in the the middle of the native character set on a 9 bit machine. Imagine the native 'e' being your string terminator.

So it's historic: "Can it run on a toaster chip" was a a fundamental (if retrofitted) design principle for C. Type widths are rather weakly defined in C, so implementations could use "native" elements - char being "the smallest individually adressable element", and that wasn't nor isn't 8 bit for all machines. 0 was widely unused anyway.

For the rest of your question: entirely subjective, depending on what to optimize for. It makes sense only in very strictly defined environments that are very low on ressources. E.g. in German, there are different "phone book" and "dictionary" sort rules. Which do you pick?


In the light of your examples, I'd put digits first, followed by letters (easier for dec/hex strings). I'd keep uppercase and lowercase letters apart - but, as in ascii, by a single bit. Instead of cramming it full of funny characters, I'd rather leave some chars undefined so some of these tricks work better. Optimize for sort is pointless unless you pre-defined the sort algorithm.

Upvotes: 2

Roger Pate
Roger Pate

Reputation:

You cannot realistically design a character set without considering backwards compatibility.

To discard backwards compatibility, you must have an awesome reason, and that backwards compatibility effectively means ASCII compatibility. Such a reason is going to be extremely difficult to formulate in today's interconnected world where so many charsets (either weighted by use or not) maintain it. This is going to limit you to highly-specialized embedded environments.

Let's imagine one of those environments: a microwave oven. It has to display numbers as well as letters; things like "popcorn", "1 oz", "1.2 oz" (popcorn bag sizes), and so forth. It does absolutely no communication with any other device. It has no inherent need for any control codes (imagine a single-line LCD display: even newline is meaningless). We can even say you're selling this microwave only in areas that speak English and selecting a different UI language is a complete non-issue.

Even then, staying ASCII compatible has very nice benefits with minimal downsides. For example, you can test production code inside software-emulated hardware and still use common debuggers.

Toss out many letters you never use and only use uppercase (or only lowercase), numbers, and minimal punctuation (space, period). That leaves you with less than 5 bits required in a minimal scheme. Maybe less if you start tossing out individual letters of the alphabet, but it will be hard to hit only 4 letters to stay within 4 bits—4-bits = 16 and 16 - 10 numbers - 2 punctuation = 4.

But it's not like the commodity hardware you'd use, in today's reality, is going to notice the difference between 40 bits (8x 5-bit chars) and 64 bits (8x 8-bit chars), and that's assuming you can even find commodity hardware that lets you shave bits like this.

Upvotes: 2

James Anderson
James Anderson

Reputation: 27478

If i were doing this from scratch I would have the following scheme:-

x00 -- x10  -- Control characters such as end of file, end of line, end of string.

x10 -- x30  -- Alphabetic characters using the following pattern:-
    x10  -> A Upper case A.
    x11  -> a Lower case A.
    x12  -> a with local accent e.g a acute.
    x13  -> a with second local accent e.g. a grave
    .....................
x40 -- x50  -- Local "extra" characters
    Thing like the Scandanavian AE or Danish /O which are regarded as separate 
    characters with thier own position in the collating scheme.

x50 -- x60 -- Punctuation .,:; etc.
x70 -- x80 -- Other special character {}/\ etc.

xF0 -- xFF -- 0 to 9

There would be a number of advantages to this scheme (none of which are worth the pain of implentation and conversion!).

Firstly isnumeric isalpha etc can be implemented with simple bit mask.

Secondly collating would automatically fall into a natural sequence.

Ale, alchohol, ácute, áccentgrave, Beer Øl

However fitting a complex multicultural world into an eight bit scheme is just not possible and any scheme proposed would be somehow compromised. The real solution is to listen to the good folks at the UNICODE consortium who have all the bases covered by simple using 16 bots (or more!).

Upvotes: 2

Philip Potter
Philip Potter

Reputation: 9135

What problems do you see with existing character sets that you hope to solve with a new one?

The efficiency savings of only needing c < 52 rather than c > M && c < N are marginal at best, given that this is rarely a bottleneck. Moreover, isalpha() and isalnum() are locale-specific and need to take care of accented characters, so in locales other than the one you design the charset for, you don't get any savings at all.

Your second idea of aàáâãbcdeèéêëfgh... is nice for ordering single characters according to a particular locale, but it doesn't help ordering multicharacter strings in languages where some characters are equivalent with respect to ordering. For example in German dictionaries umlauts are ignored for ordering purposes (abc < äbd < abe) so you still couldn't do a simple lexicographic order of char values.

Upvotes: 1

zneak
zneak

Reputation: 138171

I wouldn't design an 8-bit encoding. That's dumb. There are far more than 255 human characters.

However, if I could just have a remake of the ANSI character set, I'd remove all the now-defunct control characters than span from 1 to 31. The rest is pretty much okay in my opinion. You have to take into account how strings are sorted too (like how a string starting with an underscore should be sorted before a string starting with a numerical character).

That being said, the rationale for making 0 the string terminator is probably that 0 means false in a condition, so you can iterate through a string just by checking if the character is non-zero, like if(*string) instead of if(*string != 0xFF).

Also, community wiki.

Upvotes: 2

Related Questions