Why do some strings look the same but are deemed non-identical when checked for string equivalence?

Question

I'm currently trying to tackle a question from my assignment which asks why similar looking strings are checked to be non-identical.

The question is stated as below:

In a computer program’s code, two string variables are declared. When their respective values are printed by the program onto the computer screen, both appear as the string "ĝ" . However, the program returns false when both variables are checked for their string equivalence (i.e. false means both strings are considered non-identical).

What could be the most likely cause of these seemingly contradictory results? Assume that the UTF-8 encoding is used by the computer program.

The question expects to give a reason on why such contradictory result occurred and how the UTF-8 encoding works in this scenario

My current bet is that there is another character that looks similar to "ĝ" but has a different unicode representation but I'm not entirely sure about it as well.

Mark Tolonen · Accepted Answer

Unicode has combining characters, so you could have:

U+011D LATIN SMALL LETTER G WITH CIRCUMFLEX

or:

U+0067 LATIN SMALL LETTER G
U+0302 COMBINING CIRCUMFLEX ACCENT

Visually these will print the same (Python code example):

>>> print('\u011d \u0067\u0302')
ĝ ĝ

FYI, in UTF-8 encoding, that would be hexadecimal bytes C4 9D vs. 67 CC 82.

Why do some strings look the same but are deemed non-identical when checked for string equivalence?

Answers (1)

Related Questions