Reputation: 1
I found out that the Turkish/Azeri LATIN CAPITAL LETTER I WITH DOT ABOVE (U+0130
, İ
) is the only character that gets converted into two when I use the .lower() method in Python.
my_str = "İZMİR FUßBALL CAFÉ"
print(my_str.lower())
>>> i̇zmi̇r fußball café
See the double dot in i̇
? This is the regular Latin small i
and a COMBINING DOT ABOVE (U+0307
). I find it quite surprising, as I would expect the output to be just i
(without the combining dot).
This is important for me, since this causes the lengths of the original and the lowercased strings to be different. The way I understand it is that it should be the .casefold() method that interferes with the string length and replaces more characters for one (e.g. ß
> ss
). And .lower() should leave the length intact.
You can see below that the lengths are different and, consequently, the characters don't match.
print(len(my_str))
print(len(my_str.lower()))
>>> 18
>>> 20
from itertools import zip_longest
for a, b in zip_longest(my_str, my_str.lower()):
print(a, " ", b)
>>> İ i
>>> Z ̇
>>> M z
>>> İ m
>>> R i
>>> ̇
>>> F r
>>> U
>>> ß f
>>> B u
>>> A ß
>>> L b
>>> L a
>>> l
>>> C l
>>> A
>>> F c
>>> É a
>>> None f
>>> None é
Here I checked that it is indeed the ONLY character that behaves in this weird way when lowercased (based on a script from this answer).
import sys
import unicodedata as ud
print("Unicode version:", ud.unidata_version, "\n")
total = 0
for codepoint in map(chr, range(sys.maxunicode)):
lower = codepoint.lower()
if len(lower) > 1:
total += 1
for conversion, converted in zip(
("orig", "lower"),
(codepoint, lower)
):
print(conversion, [ud.name(cp) for cp in converted], converted)
print()
print("Total chars converted to more than one char when lowercased:", total)
>>> Unicode version: 13.0.0
>>>
>>> orig ['LATIN CAPITAL LETTER I WITH DOT ABOVE'] İ
>>> lower ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE'] i̇
>>>
>>> Total chars converted to more than one char when lowercased: 1
I don't understand the logic behind this substitution. This seems like something implemented so you can reconstruct an İ
if there is an i̇
substring in a string.lower()
. But when there's an a
, you can't tell if the original string had an A
or an a
, so why is there an exception for this character?
My original problem was the umatching lengths of the original and lowercase strings, but since I found the culprit in my data (which is the letter İ
), I already have a solution (either replacing i̇
to i
in the target string or customizing the .lower() method). But I am still curious why is it like that in Python in the first place, and that's why I'm asking the question.
Upvotes: 0
Views: 81