When is an umlaut not an umlaut (u-umlaut maps to charCode 117)

Question

> let uValFromStr  = "Würzburg".charCodeAt(1)
undefined
> uValFromStr
117
> String.fromCharCode(252)
'ü'
> "Würzburg".charCodeAt(1) === String.fromCharCode(252)
false
>

We have a case where an umlaut in a string is failing a simple string comparison test because its value is actually mapped to charCode 117. The u-umlaut should be mapped to charCode 252. Note the first two lines where we extract the character's charCode. So when this occurs, a user enters a text string matching the first three characters and the match fails as code is evaluating 117===252.

Any ideas as to how this can occur? We have numerous use cases with umlauts in our data which work correctly so it is not an endemic issue but rather one that is particular to this input only (so far).

T.J. Crowder · Accepted Answer

The ü in that specific "Würzburg" string is written using the Unicode code point for u (U+0075) followed by an umlaut combining mark (U+0308) which modifies it, but the ü you're comparing it to is written with the single Unicode code point for u-with-umlaut (U+00FC). Nearly all of JavaScript's string handling is quite naive, which is why they aren't equal. This naive (but fast!) nature has two parts: 1) It doesn't know about combining marks, which is why "Würzburg".length is 9 instead of 8 (if the ü is written using U+0075 and U+00FC); and 2) JavaScript "characters" are actually UTF-16 code units, which may be only half of a code point ("😊".length is 2, for instance, because although it's a single Unicode code point (U+1F60A), it requires two code units to be expressed in UTF-16). (One can argue that JavaScript strings are UCS-2 because they tolerate invalid surrogate pairs [pairs of code units that, taken together, describe a code point], but the spec says "...each element in the String is treated as a UTF-16 code unit value...")

You can solve this problem with comparing those two umlauted u's by using normalization, via JavaScript's (relatively new) normalize method:

const word = "Würzburg";

// Iteration moves through the string by code points, not code units
for (const ch of word) {
    console.log(`${ch} = ${ch.codePointAt(0)}`);
}

const char = String.fromCharCode(252);

const normalizedWord = word.normalize();
const normalizedChar = char.normalize();

// Using iteration to grab the second "character" (code point) from the string
const [, secondCharOfWord] = normalizedWord;

console.log(normalizedChar === secondCharOfWord); // true

.as-console-wrapper {
    max-height: 100% !important;
}

In that example we use the default normalization ("NFC," Normalization Form C), which prefers specific code points to combining marks, so the normalized version of the word uses u-with-umlaut code point U+00FC. There are other normalization forms available by passing an argument to normalize (such as Normalization Form D, which prefers combining marks to specific character code points), but the default is usually the one you want.

When is an umlaut not an umlaut (u-umlaut maps to charCode 117)

Answers (1)

Related Questions