Alexis Delrieu
Alexis Delrieu

Reputation: 1513

How to fix the encoding of a string in JavaScript

I have received a broken string from another piece of software. I would have liked to fix its encoding in JavaScript but I feel I am missing something.

Here's an exemple of broken string: Détecté àlors ôù
And the expected output would be: Détecté àlors ôùi

I don't know the encoding used to send me the string.

My idea is to use the TextDecoder API; convert the string to bytes, and then reencode it in UTF8 or UTF16.

Here's the piece of code I used to detect the charset used:

const str = 'Détecté àlors ôùi';
const str2 = 'Détecté àlors ôù';

const charsets = [
  'utf-8',
  "ibm866",
  "iso-8859-2",
  "iso-8859-3",
  "iso-8859-4",
  "iso-8859-5",
  "iso-8859-6",
  "iso-8859-7",
  "iso-8859-8",
  "iso-8859-8-i",
  "iso-8859-10",
  "iso-8859-13",
  "iso-8859-14",
  "iso-8859-15",
  "iso-8859-16",
  "koi8-r",
  "koi8-u",
  "macintosh",
  "windows-874",
  "windows-1250",
  "windows-1251",
  "windows-1252",
  "windows-1253",
  "windows-1254",
  "windows-1255",
  "windows-1256",
  "windows-1257",
  "windows-1258",
  "x-mac-cyrillic",
  "gbk",
  "gb18030",
  "hz-gb-2312",
  "big5",
  "euc-jp",
  "iso-2022-jp",
  "shift-jis",
  "euc-kr",
  "iso-2022-kr",
  "utf-16be",
  "utf-16le",
  "iso-2022-cn"
];

const encoder = new TextEncoder();
const view = encoder.encode(str2);

console.log('__________________')

charsets.forEach((charset) => {
  try {
    const decoder = new TextDecoder(charset);
    const fixedStr = decoder.decode(view, {
      fatal: false,
      ignoreBOM: true,
    });

    console.log(charset, fixedStr);
  } catch (e) {
    console.log(charset, 'invalid');
  }
})

(the code can be tested here: https://jsfiddle.net/tashebwj/ )

The output is the following:

__________________
?editor_console=true:57 utf-8 Détecté àlors ôù
?editor_console=true:57 ibm866 D├Г┬йtect├Г┬й ├Г┬аlors ├Г┬┤├Г┬╣
?editor_console=true:57 iso-8859-2 DĂŠtectĂŠ Ă lors Ă´Ăš
?editor_console=true:57 iso-8859-3 D�Âİtect�Âİ � lors �´�Âı
?editor_console=true:57 iso-8859-4 DÊtectÊ àlors ôÚ
?editor_console=true:57 iso-8859-5 DУТЉtectУТЉ УТ lors УТДУТЙ
?editor_console=true:57 iso-8859-6 Dأآ�tectأآ� أآ lors أآ�أآ�
?editor_console=true:57 iso-8859-7 DΓΒ©tectΓΒ© ΓΒ lors ΓΒ΄ΓΒΉ
?editor_console=true:57 iso-8859-8 D��©tect��© �� lors ��´��¹
?editor_console=true:57 iso-8859-8-i D��©tect��© �� lors ��´��¹
?editor_console=true:57 iso-8859-10 DÃÂĐtectÃÂĐ Ã lors ÃÂīÃÂđ
?editor_console=true:57 iso-8859-13 DĆĀ©tectĆĀ© ĆĀ lors ĆĀ“ĆĀ¹
?editor_console=true:57 iso-8859-14 Détecté àlors ÃÂṀÃÂṗ
?editor_console=true:57 iso-8859-15 Détecté àlors ÃŽù
?editor_console=true:57 iso-8859-16 DĂ©tectĂ© Ă lors ĂÂŽĂÂč
?editor_console=true:57 koi8-r Dц┐б╘tectц┐б╘ ц┐б═lors ц┐б╢ц┐б╧
?editor_console=true:57 koi8-u Dц┐б╘tectц┐б╘ ц┐б═lors ц┐бЄц┐б╧
?editor_console=true:57 macintosh Détecté àlors ôù
?editor_console=true:57 windows-874 Dรยฉtectรยฉ รย lors รยดรยน
?editor_console=true:57 windows-1250 DĂ©tectĂ© Ă lors Ă´ĂÂą
?editor_console=true:57 windows-1251 DГѓВ©tectГѓВ© ГѓВ lors ГѓВґГѓВ№
?editor_console=true:57 windows-1252 Détecté àlors ôù
?editor_console=true:57 windows-1253 Détecté àlors ôù
?editor_console=true:57 windows-1254 Détecté àlors ôù
?editor_console=true:57 windows-1255 Dֳƒֲ©tectֳƒֲ© ֳƒֲ lors ֳƒֲ´ֳƒֲ¹
?editor_console=true:57 windows-1256 Dأƒآ©tectأƒآ© أƒآ lors أƒآ´أƒآ¹
?editor_console=true:57 windows-1257 DĆĀ©tectĆĀ© ĆĀ lors ĆĀ´ĆĀ¹
?editor_console=true:57 windows-1258 DĂƒÂ©tectĂƒÂ© ĂƒÂ lors ĂƒÂ´ĂƒÂ¹
?editor_console=true:57 x-mac-cyrillic D√Г¬©tect√Г¬© √Г¬†lors √Г¬і√Г¬є
?editor_console=true:57 gbk D脙漏tect脙漏 脙聽lors 脙麓脙鹿
?editor_console=true:57 gb18030 D脙漏tect脙漏 脙聽lors 脙麓脙鹿
?editor_console=true:57 hz-gb-2312 invalid
?editor_console=true:57 big5 D�穢tect�穢 ��饊ors �織�繒
?editor_console=true:57 euc-jp D�息tect�息 ��lors �卒�孫
?editor_console=true:57 iso-2022-jp D����tect���� ����lors ��������
?editor_console=true:57 shift-jis Dテδゥtectテδゥ テδ�lors テδエテδケ
?editor_console=true:57 euc-kr D횄짤tect횄짤 횄혻lors 횄쨈횄쨔
?editor_console=true:57 iso-2022-kr invalid
?editor_console=true:57 utf-16be 䓃菂ꥴ散瓃菂ꤠ쎃슠汯牳⃃菂듃菂�
?editor_console=true:57 utf-16le 썄슃璩捥썴슃₩菃ꃂ潬獲쌠슃쎴슃�
?editor_console=true:57 iso-2022-cn invalid

Why this method does not work? Is it possible to fix the string with this method or another way?

Upvotes: 2

Views: 788

Answers (1)

Danny Lin
Danny Lin

Reputation: 2300

Run:

> encodeURIComponent("Détecté àlors ôùi")  // str_expected
< 'D%C3%A9tect%C3%A9%20%C3%A0lors%20%C3%B4%C3%B9i'
> escape("Détecté àlors ôùi")
< 'D%E9tect%E9%20%E0lors%20%F4%F9i'

And then:

> escape("Détecté à lors ôù")  // str_actual
< 'D%C3%A9tect%C3%A9%20%C3%20lors%20%C3%B4%C3%B9'

We can see that both are almost identical, and can thus conclude that this issue is because the UTF-8 code points in str_expected:

D\xC3\xA9tect\xC3\xA9\x20\xC3\xA0lors\x20\xC3\xB4\xC3\xB9i

are misinterpreted as Unicode points in str_actual (converting each byte into UTF-16 code point):

D\u00C3\u00A9tect\u00C3\u00A9\u0020\u00C3\u00A0lors\u0020\u00C3\u00B4\u00C3\u00B9i

instead of the expected interpretation (converting UTF-8 to UTF-16):

D\u00E9tect\u00E9\u0020\u00E0lors\u0020\u00F4\u00F9i

To recover the UTF8 byte string str_actual back to the desired Unicode string str_expected, run:

decodeURIComponent(escape(str_actual))

Additional note: the missing ending i in str_actual is probably simply because you have missed to select it. And the change of \xC3\xA0lors in str_expected into \u00C3\u0020lors in str_actual is probably because the non-breaking space (NBSP, \u00A0) in the original output \u00C3\u00A0lors has been converted to space (\u0020) when copied as text. To prevent such unexpected conversion you may have to redirect the original output stream directly to a file rather than manually select and copy it.

Upvotes: 1

Related Questions