Reputation: 963
I have a dataframe consisting of two columns: (1) Turkish cities, (2) corresponding values.
dict_ = {'City': {0: 'ADANA',
1: 'ANKARA',
2: 'ANTALYA',
3: 'AYDIN',
4: 'BALIKESİR',
5: 'BURSA',
6: 'DENİZLİ',
7: 'DÜZCE',
8: 'DİYARBAKIR',
9: 'ELAZIĞ',
10: 'GAZİANTEP',
11: 'GİRESUN',
12: 'HATAY',
13: 'KAHRAMANMARAŞ',
14: 'KARABÜK',
15: 'KARS',
16: 'KAYSERİ',
17: 'KIRIKKALE',
18: 'KIRKLARELİ',
19: 'KIRŞEHİR',
20: 'KOCAELİ',
21: 'KONYA',
22: 'KÜTAHYA',
23: 'MANİSA',
24: 'MARDİN',
25: 'MERSİN',
26: 'MUĞLA',
27: 'ORDU',
28: 'OSMANİYE',
29: 'SAKARYA',
30: 'SAMSUN',
31: 'TRABZON',
32: 'UŞAK',
33: 'YALOVA',
34: 'ZONGULDAK',
35: 'ÇORUM',
36: 'İSTANBUL',
37: 'İZMİR'},
'Value': {0: 15,
1: 25,
2: 19,
3: 2,
4: 6,
5: 5,
6: 3,
7: 1,
8: 1,
9: 1,
10: 7,
11: 2,
12: 31,
13: 5,
14: 1,
15: 1,
16: 4,
17: 5,
18: 1,
19: 1,
20: 6,
21: 4,
22: 2,
23: 1,
24: 1,
25: 5,
26: 5,
27: 4,
28: 3,
29: 2,
30: 3,
31: 2,
32: 2,
33: 1,
34: 2,
35: 2,
36: 221,
37: 6}}
data = pd.DataFrame(dict_)
When I try to capitalize the City
column (where the first letter is uppercase and the rest is lowercase), I am having a weird character issue.
data['İl'].apply(str.capitalize)
Lowercase version of "İ" changes to a character when I cannot identify, for examples:
or
import unicodedata
unicodedata.name("i̇")
# TypeError: name() argument 1 must be a unicode character, not str
I tried many solutions but to no avail!
Upvotes: 0
Views: 183
Reputation: 1447
If you have ICU4C available, its possible to use PyICU. To install PyICU:
pip install -U pyicu
There are two approaches:
First option:
dict_ = {
'City': {0: 'ADANA', 1: 'ANKARA', 2: 'ANTALYA', 3: 'AYDIN', 4: 'BALIKESİR', 5: 'BURSA', 6: 'DENİZLİ', 7: 'DÜZCE', 8: 'DİYARBAKIR', 9: 'ELAZIĞ', 10: 'GAZİANTEP', 11: 'GİRESUN', 12: 'HATAY', 13: 'KAHRAMANMARAŞ', 14: 'KARABÜK', 15: 'KARS', 16: 'KAYSERİ', 17: 'KIRIKKALE', 18: 'KIRKLARELİ', 19: 'KIRŞEHİR', 20: 'KOCAELİ', 21: 'KONYA', 22: 'KÜTAHYA', 23: 'MANİSA', 24: 'MARDİN', 25: 'MERSİN', 26: 'MUĞLA', 27: 'ORDU', 28: 'OSMANİYE', 29: 'SAKARYA', 30: 'SAMSUN', 31: 'TRABZON', 32: 'UŞAK', 33: 'YALOVA', 34: 'ZONGULDAK', 35: 'ÇORUM', 36: 'İSTANBUL', 37: 'İZMİR'},
'Value': {0: 15, 1: 25, 2: 19, 3: 2, 4: 6, 5: 5, 6: 3, 7: 1, 8: 1, 9: 1, 10: 7, 11: 2, 12: 31, 13: 5, 14: 1, 15: 1, 16: 4, 17: 5, 18: 1, 19: 1, 20: 6, 21: 4, 22: 2, 23: 1, 24: 1, 25: 5, 26: 5, 27: 4, 28: 3, 29: 2, 30: 3, 31: 2, 32: 2, 33: 1, 34: 2, 35: 2, 36: 221, 37: 6}
}
data = pd.DataFrame(dict_)
# Create a Turkish locale instance:
locale = icu.Locale('tr')
# Create a Turkish collator instance:
collator = icu.Collator.createInstance(locale)
# create a function that performs Turkish title casing:
def turkish_title(city, loc=locale):
return icu.CaseMap.toTitle(loc, city)
# Use function to update city names:
data['City'] = data['City'].apply(turkish_title)
# Sort dataframe using Turkish collation
data.sort_values("City", key = lambda x: x.map(collator.getSortKey))
data.head(15)
# City Value
# 0 Adana 15
# 1 Ankara 25
# 2 Antalya 19
# 3 Aydın 2
# 4 Balıkesir 6
# 5 Bursa 5
# 35 Çorum 2
# 6 Denizli 3
# 8 Diyarbakır 1
# 7 Düzce 1
# 9 Elazığ 1
# 10 Gaziantep 7
# 11 Giresun 2
# 12 Hatay 31
# 36 İstanbul 221
Second option:
Reusing code from above:
# Update the dictionary, using language sensitive title casing
dict_['City'].update({k: icu.CaseMap.toTitle(locale, v) for k, v in dict_['City'].items()})
# Create new dataframe
data2 = pd.DataFrame(dict_)
# Sort dataframe using Turkish collation:
data2.sort_values("City", key = lambda x: x.map(collator.getSortKey), inplace=True)
data2.head(15)
# City Value
# 0 Adana 15
# 1 Ankara 25
# 2 Antalya 19
# 3 Aydın 2
# 4 Balıkesir 6
# 5 Bursa 5
# 35 Çorum 2
# 6 Denizli 3
# 8 Diyarbakır 1
# 7 Düzce 1
# 9 Elazığ 1
# 10 Gaziantep 7
# 11 Giresun 2
# 12 Hatay 31
# 36 İstanbul 221
Upvotes: 1
Reputation: 6417
Based on this solution, you could try the unicode_tr package, which can be installed with:
pip install unicode_tr
With this you can do:
import pandas as pd
from unicode_tr import unicode_tr
dict_ = {
'City': {
0: 'ADANA',
1: 'ANKARA',
2: 'ANTALYA',
3: 'AYDIN',
4: 'BALIKESİR',
5: 'BURSA',
6: 'DENİZLİ',
7: 'DÜZCE',
8: 'DİYARBAKIR',
9: 'ELAZIĞ',
10: 'GAZİANTEP',
11: 'GİRESUN',
12: 'HATAY',
13: 'KAHRAMANMARAŞ',
14: 'KARABÜK',
15: 'KARS',
16: 'KAYSERİ',
17: 'KIRIKKALE',
18: 'KIRKLARELİ',
19: 'KIRŞEHİR',
20: 'KOCAELİ',
21: 'KONYA',
22: 'KÜTAHYA',
23: 'MANİSA',
24: 'MARDİN',
25: 'MERSİN',
26: 'MUĞLA',
27: 'ORDU',
28: 'OSMANİYE',
29: 'SAKARYA',
30: 'SAMSUN',
31: 'TRABZON',
32: 'UŞAK',
33: 'YALOVA',
34: 'ZONGULDAK',
35: 'ÇORUM',
36: 'İSTANBUL',
37: 'İZMİR'
},
'Value': {
0: 15,
1: 25,
2: 19,
3: 2,
4: 6,
5: 5,
6: 3,
7: 1,
8: 1,
9: 1,
10: 7,
11: 2,
12: 31,
13: 5,
14: 1,
15: 1,
16: 4,
17: 5,
18: 1,
19: 1,
20: 6,
21: 4,
22: 2,
23: 1,
24: 1,
25: 5,
26: 5,
27: 4,
28: 3,
29: 2,
30: 3,
31: 2,
32: 2,
33: 1,
34: 2,
35: 2,
36: 221,
37: 6
}
}
data = pd.DataFrame(dict_)
data["City"].apply(unicode_tr.capitalize)
which outputs:
0 Adana
1 Ankara
2 Antalya
3 Aydın
4 Balıkesir
5 Bursa
6 Denizli
7 Düzce
8 Diyarbakır
9 Elazığ
10 Gaziantep
11 Giresun
12 Hatay
13 Kahramanmaraş
14 Karabük
15 Kars
16 Kayseri
17 Kırıkkale
18 Kırklareli
19 Kırşehir
20 Kocaeli
21 Konya
22 Kütahya
23 Manisa
24 Mardin
25 Mersin
26 Muğla
27 Ordu
28 Osmaniye
29 Sakarya
30 Samsun
31 Trabzon
32 Uşak
33 Yalova
34 Zonguldak
35 Çorum
36 İstanbul
37 İzmir
Name: City, dtype: object
Upvotes: 1
Reputation: 1501
def turkish_title_case(text):
turkish_correction = {"İ": "i", "I": "ı", "Ç": "ç", "Ğ": "ğ", "Ü": "ü", "Ş": "ş", "Ö": "ö"}
for turkish, corrected in turkish_correction.items():
text = text.replace(turkish, corrected)
text = text.capitalize()
turkish_correction = {"I": "İ"}
for turkish, corrected in turkish_correction.items():
text = text.replace(turkish, corrected)
return text
Considering that the city names are fixed, this may work for this case.
Upvotes: 1