javascript encoding issue with accented characters

Question

I have a page with UTF-8 header:

And in the page I use the umbraco dictionary to fetch content in various languages. When I print this in German on the page it appears fine:

@library.GetDictionaryItem("A")

resolves to:

`Ä`

in German

However if I enter it via a script:

The alert prints:

ä

If I do

The alert prints:

Ä

So what could explain this behaviour and how can I fix the alert? As far as I can see everything is UTF-8 and the dictionary and the page encoding is fine. The problem happens within Javascript.

From what I can see from the table here, Javascript resolves the character into it's Numeric value. I used "escape, encodeUrl, decodeUrl" etc with no luck.

chr  HexCode  Numeric   HTML entity     escape(chr)  encodeURI(chr) 

ä    \xE4     ä    ä          %E4          %C3%A4

T.J. Crowder · Accepted Answer

(FWIW: Character entity ä is ä, not Ä.)

This has nothing to do with character encoding. You're outputting an HTML entity to a JavaScript string, and then asking the browser to display that JavaScript string without doing anything to interpret HTML (via alert). It's exactly as though you actually typed:

ä

...(which will show ä on the page), and

...which won't. The HTML entity isn't being used anywhere that understands HTML entities. alert doesn't interpret HTML.

But if you did this:

...you'd see the character on the page, because we're giving the entity to something (innerHTML) that will interpret HTML. And so if you make that first line:

var a = "@library.GetDictionaryItem("A")";

...and then use a in an HTML context (as above), you'll get the ä in the document.

If you always get a decimal numeric character entity (like ä) from Umbraco, since those define unicode code points and JavaScript (mostly) uses unicode code points in its strings*, you can parse the entity easily enough:

function characterFromDecimalNumericEntity(str) {
    var decNumEntRex = /^\&#(\d+);$/;
    var match = decNumEntRex.exec(str);
    var codepoint = match ? parseInt(match[1], 10) : null;
    var character = codepoint ? String.fromCharCode(codepoint) : null;
    return character;
}
alert(characterFromDecimalNumericEntity("ä")); // ä

Live Example

* Why "mostly": JavaScript strings are made up of 16-bit "characters" that correspond to UTF-16 code units, not Unicode code points (you can't store a Unicode code point in 16 bits, you need 21). All characters from the Basic Multilingual Plane fit within one UTF-16 code unit, but characters from the Supplementary Multilingual Plane, Supplementary Ideographic Plane, and so on require two UTF-16 code units for a character. One of those characters will occupy two "characters" in a JavaScript string. The function above would fail for them. More in the JavaScript spec and the Unicode FAQ.

javascript encoding issue with accented characters

Answers (1)

Related Questions