adamsierakowski
adamsierakowski

Reputation: 1

Turkish İ lowercasing as two characters: is this a bug in Python?

I found out that the Turkish/Azeri LATIN CAPITAL LETTER I WITH DOT ABOVE (U+0130, İ) is the only character that gets converted into two when I use the .lower() method in Python.

my_str = "İZMİR FUßBALL CAFÉ"
print(my_str.lower())

>>> i̇zmi̇r fußball café

See the double dot in ? This is the regular Latin small i and a COMBINING DOT ABOVE (U+0307). I find it quite surprising, as I would expect the output to be just i (without the combining dot).

This is important for me, since this causes the lengths of the original and the lowercased strings to be different. The way I understand it is that it should be the .casefold() method that interferes with the string length and replaces more characters for one (e.g. ß > ss). And .lower() should leave the length intact.

You can see below that the lengths are different and, consequently, the characters don't match.

print(len(my_str))
print(len(my_str.lower()))

>>> 18
>>> 20

from itertools import zip_longest
for a, b in zip_longest(my_str, my_str.lower()):
    print(a, "   ", b)
    
>>> İ     i
>>> Z     ̇
>>> M     z
>>> İ     m
>>> R     i
>>>       ̇
>>> F     r
>>> U      
>>> ß     f
>>> B     u
>>> A     ß
>>> L     b
>>> L     a
>>>       l
>>> C     l
>>> A      
>>> F     c
>>> É     a
>>> None     f
>>> None     é

Here I checked that it is indeed the ONLY character that behaves in this weird way when lowercased (based on a script from this answer).

import sys
import unicodedata as ud

print("Unicode version:", ud.unidata_version, "\n")
total = 0
for codepoint in map(chr, range(sys.maxunicode)):
    lower = codepoint.lower()
    if len(lower) > 1:
        total += 1
        for conversion, converted in zip(
            ("orig", "lower"),
            (codepoint, lower)
        ):
            print(conversion, [ud.name(cp) for cp in converted], converted)
        print()
print("Total chars converted to more than one char when lowercased:", total)

>>> Unicode version: 13.0.0 
>>> 
>>> orig ['LATIN CAPITAL LETTER I WITH DOT ABOVE'] İ
>>> lower ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE'] i̇
>>> 
>>> Total chars converted to more than one char when lowercased: 1

I don't understand the logic behind this substitution. This seems like something implemented so you can reconstruct an İ if there is an substring in a string.lower(). But when there's an a, you can't tell if the original string had an A or an a, so why is there an exception for this character?

My original problem was the umatching lengths of the original and lowercase strings, but since I found the culprit in my data (which is the letter İ), I already have a solution (either replacing to i in the target string or customizing the .lower() method). But I am still curious why is it like that in Python in the first place, and that's why I'm asking the question.

Upvotes: 0

Views: 81

Answers (0)

Related Questions