Reputation: 22499
I have just found this strange behaviour parsing data from IANA.
"ǃ".isalpha() # returns True
"!".isalpha() # returns False
Apparently, the two exclamation marks are different:
In [62]: hex(ord("ǃ"))
Out[62]: '0x1c3'
In [63]: hex(ord("!"))
Out[63]: '0x21'
Just wondering is there a way to prevent this to happen? What is the origin of this behaviour?
Upvotes: 2
Views: 505
Reputation: 30153
Check characters in Unicode Database. The exclamation-like ǃ
(\u1c3
) is a letter:
import unicodedata
for c in "!ǃ":
print(c,'{:04x}'.format(ord(c)),unicodedata.category(c), unicodedata.name(c))
! 0021 Po EXCLAMATION MARK ǃ 01c3 Lo LATIN LETTER RETROFLEX CLICK
Upvotes: 3
Reputation: 18426
From docs:
Return True if all characters in the string are alphabetic and there is at least one character, False otherwise. Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the “Alphabetic” property defined in the Unicode Standard.
It means the utf character you are using is defined as letter in the utf database.
>>> ord("ǃ")
451
Looking at Wikipedia List of UTF characters, the character ǃ
falls under the Latin Extended B, and that's why isalpha
returns True
Upvotes: 0