Reputation: 1513
I have received a broken string from another piece of software. I would have liked to fix its encoding in JavaScript but I feel I am missing something.
Here's an exemple of broken string: Détecté à lors ôù
And the expected output would be: Détecté àlors ôùi
I don't know the encoding used to send me the string.
My idea is to use the TextDecoder API; convert the string to bytes, and then reencode it in UTF8 or UTF16.
Here's the piece of code I used to detect the charset used:
const str = 'Détecté àlors ôùi';
const str2 = 'Détecté à lors ôù';
const charsets = [
'utf-8',
"ibm866",
"iso-8859-2",
"iso-8859-3",
"iso-8859-4",
"iso-8859-5",
"iso-8859-6",
"iso-8859-7",
"iso-8859-8",
"iso-8859-8-i",
"iso-8859-10",
"iso-8859-13",
"iso-8859-14",
"iso-8859-15",
"iso-8859-16",
"koi8-r",
"koi8-u",
"macintosh",
"windows-874",
"windows-1250",
"windows-1251",
"windows-1252",
"windows-1253",
"windows-1254",
"windows-1255",
"windows-1256",
"windows-1257",
"windows-1258",
"x-mac-cyrillic",
"gbk",
"gb18030",
"hz-gb-2312",
"big5",
"euc-jp",
"iso-2022-jp",
"shift-jis",
"euc-kr",
"iso-2022-kr",
"utf-16be",
"utf-16le",
"iso-2022-cn"
];
const encoder = new TextEncoder();
const view = encoder.encode(str2);
console.log('__________________')
charsets.forEach((charset) => {
try {
const decoder = new TextDecoder(charset);
const fixedStr = decoder.decode(view, {
fatal: false,
ignoreBOM: true,
});
console.log(charset, fixedStr);
} catch (e) {
console.log(charset, 'invalid');
}
})
(the code can be tested here: https://jsfiddle.net/tashebwj/ )
The output is the following:
__________________
?editor_console=true:57 utf-8 Détecté à lors ôù
?editor_console=true:57 ibm866 D├Г┬йtect├Г┬й ├Г┬аlors ├Г┬┤├Г┬╣
?editor_console=true:57 iso-8859-2 DĂŠtectĂŠ Ă lors Ă´Ăš
?editor_console=true:57 iso-8859-3 D�Âİtect�Âİ � lors �´�Âı
?editor_console=true:57 iso-8859-4 DÊtectÊ àlors ôÚ
?editor_console=true:57 iso-8859-5 DУТЉtectУТЉ УТ lors УТДУТЙ
?editor_console=true:57 iso-8859-6 Dأآ�tectأآ� أآ lors أآ�أآ�
?editor_console=true:57 iso-8859-7 DΓΒ©tectΓΒ© ΓΒ lors ΓΒ΄ΓΒΉ
?editor_console=true:57 iso-8859-8 D��©tect��© �� lors ��´��¹
?editor_console=true:57 iso-8859-8-i D��©tect��© �� lors ��´��¹
?editor_console=true:57 iso-8859-10 DÃÂĐtectÃÂĐ ÃÂ lors ÃÂīÃÂđ
?editor_console=true:57 iso-8859-13 DĆĀ©tectĆĀ© ĆĀ lors ĆĀ“ĆĀ¹
?editor_console=true:57 iso-8859-14 Détecté àlors ÃÂṀÃÂṗ
?editor_console=true:57 iso-8859-15 Détecté àlors ÃŽù
?editor_console=true:57 iso-8859-16 DĂ©tectĂ© Ă lors ĂÂŽĂÂč
?editor_console=true:57 koi8-r Dц┐б╘tectц┐б╘ ц┐б═lors ц┐б╢ц┐б╧
?editor_console=true:57 koi8-u Dц┐б╘tectц┐б╘ ц┐б═lors ц┐бЄц┐б╧
?editor_console=true:57 macintosh Détecté àlors ôù
?editor_console=true:57 windows-874 Dรยฉtectรยฉ รย lors รยดรยน
?editor_console=true:57 windows-1250 DĂ©tectĂ© Ă lors Ă´ĂÂą
?editor_console=true:57 windows-1251 DГѓВ©tectГѓВ© ГѓВ lors ГѓВґГѓВ№
?editor_console=true:57 windows-1252 Détecté àlors ôù
?editor_console=true:57 windows-1253 Détecté àlors ôù
?editor_console=true:57 windows-1254 Détecté àlors ôù
?editor_console=true:57 windows-1255 Dֳƒֲ©tectֳƒֲ© ֳƒֲ lors ֳƒֲ´ֳƒֲ¹
?editor_console=true:57 windows-1256 Dأƒآ©tectأƒآ© أƒآ lors أƒآ´أƒآ¹
?editor_console=true:57 windows-1257 DĆĀ©tectĆĀ© ĆĀ lors ĆĀ´ĆĀ¹
?editor_console=true:57 windows-1258 DĂƒÂ©tectĂƒÂ© ĂƒÂ lors ĂƒÂ´ĂƒÂ¹
?editor_console=true:57 x-mac-cyrillic D√Г¬©tect√Г¬© √Г¬†lors √Г¬і√Г¬є
?editor_console=true:57 gbk D脙漏tect脙漏 脙聽lors 脙麓脙鹿
?editor_console=true:57 gb18030 D脙漏tect脙漏 脙聽lors 脙麓脙鹿
?editor_console=true:57 hz-gb-2312 invalid
?editor_console=true:57 big5 D�穢tect�穢 ��饊ors �織�繒
?editor_console=true:57 euc-jp D�息tect�息 ��lors �卒�孫
?editor_console=true:57 iso-2022-jp D����tect���� ����lors ��������
?editor_console=true:57 shift-jis Dテδゥtectテδゥ テδ�lors テδエテδケ
?editor_console=true:57 euc-kr D횄짤tect횄짤 횄혻lors 횄쨈횄쨔
?editor_console=true:57 iso-2022-kr invalid
?editor_console=true:57 utf-16be 䓃菂ꥴ散瓃菂ꤠ쎃슠汯牳菂듃菂�
?editor_console=true:57 utf-16le 썄슃璩捥썴슃₩菃ꃂ潬獲쌠슃쎴슃�
?editor_console=true:57 iso-2022-cn invalid
Why this method does not work? Is it possible to fix the string with this method or another way?
Upvotes: 2
Views: 788
Reputation: 2300
Run:
> encodeURIComponent("Détecté àlors ôùi") // str_expected
< 'D%C3%A9tect%C3%A9%20%C3%A0lors%20%C3%B4%C3%B9i'
> escape("Détecté àlors ôùi")
< 'D%E9tect%E9%20%E0lors%20%F4%F9i'
And then:
> escape("Détecté à lors ôù") // str_actual
< 'D%C3%A9tect%C3%A9%20%C3%20lors%20%C3%B4%C3%B9'
We can see that both are almost identical, and can thus conclude that this issue is because the UTF-8 code points in str_expected
:
D\xC3\xA9tect\xC3\xA9\x20\xC3\xA0lors\x20\xC3\xB4\xC3\xB9i
are misinterpreted as Unicode points in str_actual
(converting each byte into UTF-16 code point):
D\u00C3\u00A9tect\u00C3\u00A9\u0020\u00C3\u00A0lors\u0020\u00C3\u00B4\u00C3\u00B9i
instead of the expected interpretation (converting UTF-8 to UTF-16):
D\u00E9tect\u00E9\u0020\u00E0lors\u0020\u00F4\u00F9i
To recover the UTF8 byte string str_actual
back to the desired Unicode string str_expected
, run:
decodeURIComponent(escape(str_actual))
Additional note: the missing ending i
in str_actual
is probably simply because you have missed to select it. And the change of \xC3\xA0lors
in str_expected
into \u00C3\u0020lors
in str_actual
is probably because the non-breaking space (NBSP, \u00A0
) in the original output \u00C3\u00A0lors
has been converted to space (\u0020
) when copied as text. To prevent such unexpected conversion you may have to redirect the original output stream directly to a file rather than manually select and copy it.
Upvotes: 1