user13009077
user13009077

Reputation: 1

Unicode not displaying correctly using JavaScript

convert.onclick =
  function() {
    for (var i = 0; i < before.value.length; i++) {
      after.value += "'" + before.value.charAt(i) + "', ";
    }
  }
<textarea id="before" type="text" name="input" style="width:100%;">*𝟑(𝟘)-_=+π•’β„šπ•¨π•Žπ•–π”Όπ•£β„π•₯𝕋π•ͺπ•π•¦π•Œ</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

Here's a simple code, and when I operate it, I get the following results.

Problem

Some letters have been converted successfully. But most Unicode characters are not displayed normally. How do I fix this problem?

Upvotes: 0

Views: 314

Answers (3)

connexo
connexo

Reputation: 56770

That is because starting from a certain point in UTF-8, characters can have length > 1.

console.log("9".length);
console.log("𝟑".length);

console.log("𝟑".charAt(0));
console.log(String.fromCodePoint("𝟑".codePointAt(0)));

To fix it, instead of charAt use codePoint and codePointAt:

convert.onclick =
  function() {
    for (const char of before.value) {
      after.value += `'${String.fromCodePoint(char.codePointAt(0))}'`;
    }
  }
<textarea id="before" type="text" name="input" style="width:100%;">*𝟑(𝟘)-_=+π•’β„šπ•¨π•Žπ•–π”Όπ•£β„π•₯𝕋π•ͺπ•π•¦π•Œ</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

You can also do an index-based traversal, but that requires to increase the index varaible inside the loop, depending on the currently traversed character's length:

convert.onclick =
  function() {
    for (let i = 0; i < before.value.length; ) {
      after.value += `'${String.fromCodePoint(before.value.codePointAt(i))}'`;
      i+= String.fromCodePoint(before.value.codePointAt(i)).length;
    }
  }
<textarea id="before" type="text" name="input" style="width:100%;">*𝟑(𝟘)-_=+π•’β„šπ•¨π•Žπ•–π”Όπ•£β„π•₯𝕋π•ͺπ•π•¦π•Œ</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

Upvotes: 0

Klaycon
Klaycon

Reputation: 11060

What you're running into are called surrogate pairs. Some unicode characters are composed of two bytes instead of one, and if you separate them, they no longer display correctly.

If you can use ES6, iterating a string with the spread operator or for..of syntax actually takes surrogate pairs into account and will give you correct results easier. Other answers show how to do this.

If you can't use ES6, MDN has an example of how to handle these with charAt here. I'll use this code below.

function getWholeChar(str, i) {
  var code = str.charCodeAt(i);

  if (Number.isNaN(code)) return '';
  if (code < 0xD800 || code > 0xDFFF) return str.charAt(i);

  if (0xD800 <= code && code <= 0xDBFF) {
    if (str.length <= (i + 1)) throw 'High surrogate without following low surrogate';
    var next = str.charCodeAt(i + 1);
    if (0xDC00 > next || next > 0xDFFF) throw 'High surrogate without following low surrogate';
    return str.charAt(i) + str.charAt(i + 1);
  }
  
  if (i === 0) throw 'Low surrogate without preceding high surrogate';
  var prev = str.charCodeAt(i - 1);

  if (0xD800 > prev || prev > 0xDBFF) throw 'Low surrogate without preceding high surrogate';
  return false;
}

convert.onclick =
  function() {
    for (var i = 0, chr; i < before.value.length; i++) {
      if(!(chr = getWholeChar(before.value, i))) continue;
      after.value += "'" + chr + "', ";
    }
  }
<textarea id="before" type="text" name="input" style="width:100%;">*𝟑(𝟘)-_=+π•’β„šπ•¨π•Žπ•–π”Όπ•£β„π•₯𝕋π•ͺπ•π•¦π•Œ</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

Upvotes: 2

ponury-kostek
ponury-kostek

Reputation: 8060

You can use spread operator (...) to create array of unicode characters

convert.onclick = function () {
	after.value = [...before.value].map(s => `'${s}'`).join(",");
};
<textarea id="before" type="text" name="input" style="width:100%;">*𝟑(𝟘)-_=+π•’β„šπ•¨π•Žπ•–π”Όπ•£β„π•₯𝕋π•ͺπ•π•¦π•Œ</textarea><br />
<textarea id="after" cols="50" rows="10" name="output" style="width:100%;"></textarea>
<button id="convert" name="convert" type="button">convert</button>

Upvotes: 1

Related Questions