Nick
Nick

Reputation: 2907

javascript encoding issue with accented characters

I have a page with UTF-8 header:

<meta charset="utf-8" />

And in the page I use the umbraco dictionary to fetch content in various languages. When I print this in German on the page it appears fine:

<h1>@library.GetDictionaryItem("A")</h1>

resolves to:

<h1>Ä</h1> in German

However if I enter it via a script:

<script type="text/javascript" charset="utf-8">
  var a = "@library.GetDictionaryItem("A")";
  alert(a);
</script>

The alert prints:

&#228;

If I do

<script type="text/javascript" charset="utf-8">
  var a = "Ä";
  alert(a);
</script>

The alert prints:

Ä

So what could explain this behaviour and how can I fix the alert? As far as I can see everything is UTF-8 and the dictionary and the page encoding is fine. The problem happens within Javascript.

From what I can see from the table here, Javascript resolves the character into it's Numeric value. I used "escape, encodeUrl, decodeUrl" etc with no luck.

chr  HexCode  Numeric   HTML entity     escape(chr)  encodeURI(chr) 

ä    \xE4     &#228;    &auml;          %E4          %C3%A4 

Upvotes: 1

Views: 11957

Answers (1)

T.J. Crowder
T.J. Crowder

Reputation: 1074028

(FWIW: Character entity &#228; is ä, not Ä.)

This has nothing to do with character encoding. You're outputting an HTML entity to a JavaScript string, and then asking the browser to display that JavaScript string without doing anything to interpret HTML (via alert). It's exactly as though you actually typed:

<h1>&#228;</h1>

...(which will show ä on the page), and

<script>
var a = "&#228;";
alert(a);
</script>

...which won't. The HTML entity isn't being used anywhere that understands HTML entities. alert doesn't interpret HTML.

But if you did this:

<script>
var a = "&#228;";
var div = document.createElement('div');
div.innerHTML = a;
document.body.appendChild(div);
</script>

...you'd see the character on the page, because we're giving the entity to something (innerHTML) that will interpret HTML. And so if you make that first line:

var a = "@library.GetDictionaryItem("A")";

...and then use a in an HTML context (as above), you'll get the ä in the document.

If you always get a decimal numeric character entity (like &#228;) from Umbraco, since those define unicode code points and JavaScript (mostly) uses unicode code points in its strings*, you can parse the entity easily enough:

function characterFromDecimalNumericEntity(str) {
    var decNumEntRex = /^\&#(\d+);$/;
    var match = decNumEntRex.exec(str);
    var codepoint = match ? parseInt(match[1], 10) : null;
    var character = codepoint ? String.fromCharCode(codepoint) : null;
    return character;
}
alert(characterFromDecimalNumericEntity("&#228;")); // ä

Live Example

* Why "mostly": JavaScript strings are made up of 16-bit "characters" that correspond to UTF-16 code units, not Unicode code points (you can't store a Unicode code point in 16 bits, you need 21). All characters from the Basic Multilingual Plane fit within one UTF-16 code unit, but characters from the Supplementary Multilingual Plane, Supplementary Ideographic Plane, and so on require two UTF-16 code units for a character. One of those characters will occupy two "characters" in a JavaScript string. The function above would fail for them. More in the JavaScript spec and the Unicode FAQ.

Upvotes: 3

Related Questions