Sean Anderson
Sean Anderson

Reputation: 29301

Comparing two strings with special characters via charCodeAt

My goal is to create a method which takes two strings with special characters and properly compares them. I'm struggling with understanding the logistics of character encoding.

So, my string looks like:

Häzel - This Girl Is Watching Me

I have two copies of this string. One was provided to me via a third-party API via $.ajax() and the other is a de-serialization from my server, also using $.ajax().

My original string, when represented as char codes, looks like:

Array[33]
0: 72
1: 97
2: 776
3: 122
4: 101
5: 108
6: 32
7: 45
8: 32
9: 84
10: 104
11: 105
12: 115
13: 32
14: 71
15: 105
16: 114
17: 108
18: 32
19: 73
20: 115
21: 32
22: 87
23: 97
24: 116
25: 99
26: 104
27: 105
28: 110
29: 103
30: 32
31: 77
32: 101

and afterwards:

Array[32]
0: 72
1: 228
2: 122
3: 101
4: 108
5: 32
6: 45
7: 32
8: 84
9: 104
10: 105
11: 115
12: 32
13: 71
14: 105
15: 114
16: 108
17: 32
18: 73
19: 115
20: 32
21: 87
22: 97
23: 116
24: 99
25: 104
26: 105
27: 110
28: 103
29: 32
30: 77
31: 101

with the difference being the "ä" is represented as [97, 776] before serialization and [228] after serialization.

I'm wondering a few things:

In my mind they are exactly the same. I have no preference on encoding at this point in time -- I only wish for the two strings to be equatable.

Upvotes: 2

Views: 2187

Answers (1)

Shi
Shi

Reputation: 4258

A string is a sequence of bytes. As such, it cannot be rendered as anything except as bit pattern. Next, a character set comes into play. Here, you map numbers to characters, for example 65 to A, 97 to a and 228 to ä. Finally, you need a character encoding, which maps the number to a bit pattern.

For number 228, the usual 8-bit encoding simply uses 0xE4 as bit pattern. UTF-8 encoding will use 0xC3 0xA4 as bit pattern and UTF-16 will use 0x00 0xE4 as bit pattern.

So in order to properly compare a string, you need to know its bit pattern (byte sequence), its encoding and its character set. If you lack any, strings cannot be properly compared.

Nowadays, Unicode is used as character set most of the time. If you only use the very basic characters, ASCII will do the job as well. ASCII is a subset of Unicode as the first 127 characters (code points) are the same. For encoding, 7-bit ASCII is the same as UTF-8.

So in short, without knowing character set and character encoding (or at least knowing that they are the same for both strings), you cannot compare strings at all.

Upvotes: 2

Related Questions