Unknown character for Turkish character

I have a dataframe consisting of two columns: (1) Turkish cities, (2) corresponding values.

dict_ = {'City': {0: 'ADANA',
  1: 'ANKARA',
  2: 'ANTALYA',
  3: 'AYDIN',
  4: 'BALIKESİR',
  5: 'BURSA',
  6: 'DENİZLİ',
  7: 'DÜZCE',
  8: 'DİYARBAKIR',
  9: 'ELAZIĞ',
  10: 'GAZİANTEP',
  11: 'GİRESUN',
  12: 'HATAY',
  13: 'KAHRAMANMARAŞ',
  14: 'KARABÜK',
  15: 'KARS',
  16: 'KAYSERİ',
  17: 'KIRIKKALE',
  18: 'KIRKLARELİ',
  19: 'KIRŞEHİR',
  20: 'KOCAELİ',
  21: 'KONYA',
  22: 'KÜTAHYA',
  23: 'MANİSA',
  24: 'MARDİN',
  25: 'MERSİN',
  26: 'MUĞLA',
  27: 'ORDU',
  28: 'OSMANİYE',
  29: 'SAKARYA',
  30: 'SAMSUN',
  31: 'TRABZON',
  32: 'UŞAK',
  33: 'YALOVA',
  34: 'ZONGULDAK',
  35: 'ÇORUM',
  36: 'İSTANBUL',
  37: 'İZMİR'},
 'Value': {0: 15,
  1: 25,
  2: 19,
  3: 2,
  4: 6,
  5: 5,
  6: 3,
  7: 1,
  8: 1,
  9: 1,
  10: 7,
  11: 2,
  12: 31,
  13: 5,
  14: 1,
  15: 1,
  16: 4,
  17: 5,
  18: 1,
  19: 1,
  20: 6,
  21: 4,
  22: 2,
  23: 1,
  24: 1,
  25: 5,
  26: 5,
  27: 4,
  28: 3,
  29: 2,
  30: 3,
  31: 2,
  32: 2,
  33: 1,
  34: 2,
  35: 2,
  36: 221,
  37: 6}}

data = pd.DataFrame(dict_)

When I try to capitalize the City column (where the first letter is uppercase and the rest is lowercase), I am having a weird character issue.

data['İl'].apply(str.capitalize)

Lowercase version of "İ" changes to a character when I cannot identify, for examples:

import unicodedata
unicodedata.name("i̇")
# TypeError: name() argument 1 must be a unicode character, not str

I tried many solutions but to no avail!

Upvotes: 0

Answers (3)

Andj

Reputation: 1447

If you have ICU4C available, its possible to use PyICU. To install PyICU:

pip install -U pyicu

There are two approaches:

Create a dataframe from the dictionary, then perform a language sensitive title casing.
Title case entries in dict, then create a dataframe.

First option:

dict_ = {
    'City': {0: 'ADANA', 1: 'ANKARA', 2: 'ANTALYA', 3: 'AYDIN', 4: 'BALIKESİR', 5: 'BURSA', 6: 'DENİZLİ', 7: 'DÜZCE', 8: 'DİYARBAKIR', 9: 'ELAZIĞ', 10: 'GAZİANTEP', 11: 'GİRESUN', 12: 'HATAY', 13: 'KAHRAMANMARAŞ', 14: 'KARABÜK', 15: 'KARS', 16: 'KAYSERİ', 17: 'KIRIKKALE', 18: 'KIRKLARELİ', 19: 'KIRŞEHİR', 20: 'KOCAELİ', 21: 'KONYA', 22: 'KÜTAHYA', 23: 'MANİSA', 24: 'MARDİN', 25: 'MERSİN', 26: 'MUĞLA', 27: 'ORDU', 28: 'OSMANİYE', 29: 'SAKARYA', 30: 'SAMSUN', 31: 'TRABZON', 32: 'UŞAK', 33: 'YALOVA', 34: 'ZONGULDAK', 35: 'ÇORUM', 36: 'İSTANBUL', 37: 'İZMİR'},
    'Value': {0: 15, 1: 25, 2: 19, 3: 2, 4: 6, 5: 5, 6: 3, 7: 1, 8: 1, 9: 1, 10: 7, 11: 2, 12: 31, 13: 5, 14: 1, 15: 1, 16: 4, 17: 5, 18: 1, 19: 1, 20: 6, 21: 4, 22: 2, 23: 1, 24: 1, 25: 5, 26: 5, 27: 4, 28: 3, 29: 2, 30: 3, 31: 2, 32: 2, 33: 1, 34: 2, 35: 2, 36: 221, 37: 6}
}

data = pd.DataFrame(dict_)

# Create a Turkish locale instance:
locale = icu.Locale('tr')
# Create a Turkish collator instance:
collator = icu.Collator.createInstance(locale)

# create a function that performs Turkish title casing:
def turkish_title(city, loc=locale):
  return icu.CaseMap.toTitle(loc, city)

# Use function to update city names:
data['City'] = data['City'].apply(turkish_title)

# Sort dataframe using Turkish collation
data.sort_values("City", key = lambda x: x.map(collator.getSortKey))

data.head(15)

#           City  Value
# 0        Adana     15
# 1       Ankara     25
# 2      Antalya     19
# 3        Aydın      2
# 4    Balıkesir      6
# 5        Bursa      5
# 35       Çorum      2
# 6      Denizli      3
# 8   Diyarbakır      1
# 7        Düzce      1
# 9       Elazığ      1
# 10   Gaziantep      7
# 11     Giresun      2
# 12       Hatay     31
# 36    İstanbul    221

Second option:

Reusing code from above:

# Update the dictionary, using language sensitive title casing
dict_['City'].update({k: icu.CaseMap.toTitle(locale, v) for k, v in dict_['City'].items()})

# Create new dataframe
data2 = pd.DataFrame(dict_)

# Sort dataframe using Turkish collation:
data2.sort_values("City", key = lambda x: x.map(collator.getSortKey), inplace=True)

data2.head(15)

#           City  Value
# 0        Adana     15
# 1       Ankara     25
# 2      Antalya     19
# 3        Aydın      2
# 4    Balıkesir      6
# 5        Bursa      5
# 35       Çorum      2
# 6      Denizli      3
# 8   Diyarbakır      1
# 7        Düzce      1
# 9       Elazığ      1
# 10   Gaziantep      7
# 11     Giresun      2
# 12       Hatay     31
# 36    İstanbul    221

Upvotes: 1

Matt Pitkin

Reputation: 6417

Based on this solution, you could try the unicode_tr package, which can be installed with:

pip install unicode_tr

With this you can do:

import pandas as pd
from unicode_tr import unicode_tr

dict_ = {
    'City': {
        0: 'ADANA',
        1: 'ANKARA',
        2: 'ANTALYA',
        3: 'AYDIN',
        4: 'BALIKESİR',
        5: 'BURSA',
        6: 'DENİZLİ',
        7: 'DÜZCE',
        8: 'DİYARBAKIR',
        9: 'ELAZIĞ',
        10: 'GAZİANTEP',
        11: 'GİRESUN',
        12: 'HATAY',
        13: 'KAHRAMANMARAŞ',
        14: 'KARABÜK',
        15: 'KARS',
        16: 'KAYSERİ',
        17: 'KIRIKKALE',
        18: 'KIRKLARELİ',
        19: 'KIRŞEHİR',
        20: 'KOCAELİ',
        21: 'KONYA',
        22: 'KÜTAHYA',
        23: 'MANİSA',
        24: 'MARDİN',
        25: 'MERSİN',
        26: 'MUĞLA',
        27: 'ORDU',
        28: 'OSMANİYE',
        29: 'SAKARYA',
        30: 'SAMSUN',
        31: 'TRABZON',
        32: 'UŞAK',
        33: 'YALOVA',
        34: 'ZONGULDAK',
        35: 'ÇORUM',
        36: 'İSTANBUL',
        37: 'İZMİR'
    },
    'Value': {
        0: 15,
        1: 25,
        2: 19,
        3: 2,
        4: 6,
        5: 5,
        6: 3,
        7: 1,
        8: 1,
        9: 1,
        10: 7,
        11: 2,
        12: 31,
        13: 5,
        14: 1,
        15: 1,
        16: 4,
        17: 5,
        18: 1,
        19: 1,
        20: 6,
        21: 4,
        22: 2,
        23: 1,
        24: 1,
        25: 5,
        26: 5,
        27: 4,
        28: 3,
        29: 2,
        30: 3,
        31: 2,
        32: 2,
        33: 1,
        34: 2,
        35: 2,
        36: 221,
        37: 6
    }
}

data = pd.DataFrame(dict_)

data["City"].apply(unicode_tr.capitalize)

which outputs:

0             Adana
1            Ankara
2           Antalya
3             Aydın
4         Balıkesir
5             Bursa
6           Denizli
7             Düzce
8        Diyarbakır
9            Elazığ
10        Gaziantep
11          Giresun
12            Hatay
13    Kahramanmaraş
14          Karabük
15             Kars
16          Kayseri
17        Kırıkkale
18       Kırklareli
19         Kırşehir
20          Kocaeli
21            Konya
22          Kütahya
23           Manisa
24           Mardin
25           Mersin
26            Muğla
27             Ordu
28         Osmaniye
29          Sakarya
30           Samsun
31          Trabzon
32             Uşak
33           Yalova
34        Zonguldak
35            Çorum
36         İstanbul
37            İzmir
Name: City, dtype: object

Upvotes: 1

Faruk

Reputation: 1501

def turkish_title_case(text):
    turkish_correction = {"İ": "i", "I": "ı", "Ç": "ç", "Ğ": "ğ", "Ü": "ü", "Ş": "ş", "Ö": "ö"}

    for turkish, corrected in turkish_correction.items():
        text = text.replace(turkish, corrected)
    text = text.capitalize()

    turkish_correction = {"I": "İ"}
    for turkish, corrected in turkish_correction.items():
        text = text.replace(turkish, corrected)

    return text

Considering that the city names are fixed, this may work for this case.

Upvotes: 1

Unknown character for Turkish character

Answers (3)

Related Questions