user7711283
user7711283

Reputation:

Python3 surprising behavior of identifier being a non-ASCII Unicode character

Following code runs without an assertion error:

K = 'K'
𝕂 = '𝕂'
𝚱 = '𝚱'
π”Ž = 'π”Ž'
𝕢 = '𝕢'
π“š = 'π“š'
α΄· = 'α΄·'
assert K == 𝕂 == π”Ž == 𝕢 == π“š == α΄·
print(f'{K=}, {𝕂=}, {𝚱=}, {𝕢=}, {π”Ž=}, {π“š=}')

and prints

K='α΄·', 𝕂='α΄·', 𝚱='𝚱', 𝕢='α΄·', π”Ž='α΄·', π“š='α΄·'

I am aware of https://peps.python.org/pep-3131/ and have read the Python documentation about identifiers https://docs.python.org/3/reference/lexical_analysis.html#identifiers but haven't found any hints explaining the experienced behavior.

So my question is: What is wrong with my expectation that the value of all of the other optical apparently different identifier doesn't change if a new value is assigned to one of them?

UPDATE: taking currently available comments and answers into account raises the need to explain more about what I expect as satisfying answer to my question:

The hint about NFKC conversion behind the comparison of names of identifiers helps to understand how it comes that the experienced behavior is there, but ... it leaves me still with the question opened what is the deep reason behind the choice to have different approaches for comparison of Unicode strings depending on context in which they occur?

The way strings as string literals are compared to each other apparently differs from the way same strings are compared if they specify names of identifiers.

What am I still missing to know about to be able to see the deep reason behind the why it was decided that Unicode strings representing names of identifiers in Python are not compared the same way to each other as Unicode strings representing string literals?

If I understand it right Unicode comes with the possibility to have ambiguous specifications for the same expected outcome using either one code point representing a complex character or multiple code points with an appropriate base character plus its modifiers. Normalization of the Unicode string is then an attempt on the way to resolve the mess caused by introducing the possibility of this ambiguity in first place. But this is the Unicode specific stuff having in my eyes the heaviest impact on Unicode visualization tools like viewer and editors. What a programming language using representation of a string as a list of integer values (Unicode code points) larger than 255 actually implements is another thing, isn't it?

Below some further attempts to find a better wording for the question I seek to get answered:

What is the advantage of creating the possibility that two different Unicode strings are eventually considered not to be different if they are used as names of Python identifiers?

What is the actual feature behind what I am considering to be a not making sense behavior because of broken WYSIWYG ability?

Below some more code illustrating what is going on and demonstrating the difference in comparison between string literals and identifier names originated in same strings as the strings literals:

from unicodedata import normalize as normal
itisasitisRepr = [                char       for char in ['K', '𝕂', '𝚱', 'π”Ž', '𝕢', 'π“š', 'α΄·']]
hexintasisRepr = [         f'{ord(char):5X}' for char in itisasitisRepr]
normalizedRepr = [ normal('NFKC', char)      for char in itisasitisRepr]
hexintnormRepr = [         f'{ord(char):5X}' for char in normalizedRepr]
print(itisasitisRepr)
print(hexintasisRepr)
print(normalizedRepr)
print(hexintnormRepr)
print(f"{              'K' ==              '𝕂'  = }")
print(f"{normal('NFKC','K')==normal('NFKC','𝕂') = }")
print(α΄· == π“š, 'α΄·' == 'π“š') # gives: True, False

gives:

['K', '𝕂', '𝚱', 'π”Ž', '𝕢', 'π“š', 'α΄·']
['   4B', '1D542', '1D6B1', '1D50E', '1D576', '1D4DA', ' 1D37']
['K', 'K', 'Κ', 'K', 'K', 'K', 'K']
['   4B', '   4B', '  39A', '   4B', '   4B', '   4B', '   4B']
              'K' ==              '𝕂'  = False
normal('NFKC','K')==normal('NFKC','𝕂') = True

Upvotes: 4

Views: 148

Answers (1)

paxdiablo
paxdiablo

Reputation: 881643

Python identifiers with non-ASCII characters are subject to NFKC normalisation(1), you can see the effect in the following code:

import unicodedata
for char in ['K', '𝕂', '𝚱', 'π”Ž', '𝕢', 'π“š', 'α΄·']:
    normalised_char = unicodedata.normalize('NFKC', char)
    print(char, normalised_char, ord(normalised_char))

The output of that is:

K K 75
𝕂 K 75
𝚱 Κ 922
π”Ž K 75
𝕢 K 75
π“š K 75
α΄· K 75

This shows that all but one of those is the same identifier, which is why your assert passes (it's missing the one different identifier) and why most seem to be the same value. It's no different really to the following code, in which it is hopefully immediately clear what will happen:

a = '1'
a = '2'
b = '3'
a = '4'
a = '5'
a = '6'
a = '7'
assert a == a == a == a == a == a             # passes
print(f'{a=}, {a=}, {b=}, {a=}, {a=}, {a=}')  # a=7 a=7 b=3 a=7 a=7 a=7

In response to your update, specifically the text:

What is the advantage of creating the possibility that two different Unicode strings are eventually considered not to be different if they are used as names of Python identifiers?

My own particular viewpoint as a developer is that I want to be able to look at code and understand it. That's not going to be easy when different code-points map to similar or even identical graphemes(2), such as with:

Ω = 1
Ξ© = 2
Ξ© = Ω + Ξ©
print(Ξ© * Ξ©)

What would you expect from that code? You set omega to one, then two. You then double it to four, and print the square which is sixteen. Easy, right?

And, in actual fact, that's what you do get in Python, despite the fact that there are both omega and ohm characters in that code, and that's because they normalise to the same identifier. Were they not normalised, you would instead have the equivalent of:

omega = 1
ohm = 2
ohm = omega + ohm
print(ohm * ohm)

And this would output nine rather than sixteen. Best of luck debugging that when you can't see a difference between the omega and ohm identifiers :-)

There are also diacritics that can have different representations, such as αΈ‹:

  • U+1e0b (Latin Small Letter D with Dot Above).
  • U+0064, U+0307 (Latin Small Letter D, Combining dot above).

And this may get even more complex where a base letter can have multiple diacritics such as αΊ­, Γ§Μ‡, or ė́. The order of combining marks may be arbitrary, meaning that there could be many ways of representing the ậç̇ė́ variable (two by two by two gives eight, but there are potentially more since distinct code points also exist for "half-accented" characters like Γ§) .

No, I think I very much appreciate the normalisation that happens to Python identifiers :-)


(1) From the Python docs about identifiers:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.


(2) You can think of graphemes as the basic unit of writing (like a letter), similar to phonemes being the basic unit of speech (like a sound). So the English grapheme c has at least two phonemes, the hard-c in cook and the soft-c in ice.

And, making matters even more complex, cook shows that there is one phoneme (hard-c) giving two separate graphemes, c and k.

Now think how much more complex it gets when you introduce every other language on the planet, I'm surprised the members of the Unicode consortium don't go absolutely insane :-)

Upvotes: 9

Related Questions