Reputation: 11399
I was just parsing the following website.
There one finds the text
und wären damit auch
At first, the "ä" looks perfectly fine, but once I inspect it, it turns out that this is not the regular "ä" (represented as ascw 228) but this:
ascw: 97, char: a
ascw: 776, char: ¨
I have never before seen an "ä" represented like this.
How can it happen that a website uses this weird character combination and what might be the benefit from it?
Upvotes: 4
Views: 1108
Reputation: 1
The use of the word "diaeresis" to describe the correctly characterised "umlaut" is wrong. A diaeresis is found on the second of two vowels placed together to indicate that they should be pronounced separately. An umlaut is placed over a single letter to indicate that it is pronounced differently (in German "um" = change and "laut" = sound) as though it were combined with an "e".
Upvotes: -1
Reputation: 439
Oh my, this was the answer or original problem with the name of a fileupload.
Cannot convert argument 2 to ByteString because the character at index 6 has value 776 which is greater than 255
For future references.
Upvotes: 0
Reputation: 78945
What you don't mention in your questions is the used encoding. Quite obviously it is a Unicode based encoding.
In Unicode, code point U+0308 (776 in decimal) is the combining diaeresis. Out of the letter a
and the diaeresis, the German character ä
is created.
There are indeed two ways to represent German characters with umlauts (ä in this case). Either as a single code point:
U+00E4 latin small letter A with diaeresis
Or as a sequence of two code points:
U+0061 latin small letter A
U+0308 combining diaeresis
Similarly you would combine two code points for an upper case 'Ä':
U+0041 latin capital letter A
U+0308 combining diaeresis
In most cases, Unicode works with two codes points as it requires fewer code points to enable a wide range characters with diacritics. However for historical reasons a special code point exist for letters with German umlauts and French accents.
The Unicode libraries is most programming languages provide functions to normalize a string, i.e. to either convert all sequences into a single code point if possible or extend all single code points into the two code point sequence. Also see Unicode Normalization Forms.
Upvotes: 6