Marcel
Marcel

Reputation: 15732

Javascript: compare two strings with actually different encoding

I am comparing two presumably differently encoded file names, in Javascript, with the hope to find matches:

Analysis

When comparing the log output in the javascript console, these file names look exactly identical:

15 - Beschänkt und gsägnet - PLAYBACKVERSION.mp3
15 - Beschänkt und gsägnet - PLAYBACKVERSION.mp3

Note the german umlauts.

Now, when I just copy and paste these strings into Notepad++ and enable the hex editor, it looks like this:

The above text in hex

Question

How can I safely compare those two strings. Is there a general "unencode" method in Javascript that can handle these instances? Or should I / must I guess each encoding and then compare explicitly?

Note

Upvotes: 2

Views: 2068

Answers (1)

AKX
AKX

Reputation: 169042

What's happening here?

If you have a String in JavaScript, it's a sequence of Unicode codepoints. Some component has already decoded the bytes representing those strings from the ZIP or the plist into a sequence of codepoints.

That is, this question is not quite about encodings, but about Unicode decomposition and normalization forms.

It's possible to encode an ä in (at least) two different ways in Unicode (examples below in Python due to the useful outputs).

>>> "ä".encode("UTF-8")
b'\xc3\xa4'  # two bytes
>>> [ord(c) for c in "ä"]
[228]
>>> [unicodedata.name(c) for c in "ä"]
['LATIN SMALL LETTER A WITH DIAERESIS']

or in the NFKD normalization form, taking two codepoints and three bytes in UTF-8.

>>> unicodedata.normalize("NFKD", "ä").encode("UTF-8")
b'a\xcc\x88'  # three bytes
>>> [ord(c) for c in unicodedata.normalize("NFKD", "ä")]
[97, 776]  # two codepoints
>>> [unicodedata.name(c) for c in unicodedata.normalize("NFKD", "ä")]
['LATIN SMALL LETTER A', 'COMBINING DIAERESIS']

Answer

Long story short, in JavaScript, you'll need to call String#normalize() to make sure the strings are in the same normalization form before attempting regular comparison.

$ node
Welcome to Node.js v16.6.1.
Type ".help" for more information.
> var a = '15 - Beschänkt und gsägnet - PLAYBACKVERSION.mp3';
undefined
> var b = '15 - Beschänkt und gsägnet - PLAYBACKVERSION.mp3';
undefined
> a.length
50
> b.length
48
> a === b
false
> a.normalize() === b.normalize()
true
>

Upvotes: 4

Related Questions