Reputation: 145
I would like to know if there is any way to find a character (uft-8 format) equivalent to a non-ASCII character.
I've done several tests with the unidecode library, but it doesn't completely satisfy what I need.
For example, consider this characters:
import unidecode
x = 'ⁱ, ᴠ, Ғ, Ƭ, ѳ, ❶'
unidecode.unidecode(x)
Output = "i, V, G', T, f, "
I would like something like: "i, v, f, t, o, 1"
There must be a way? Thanks in advance for any help!!
Upvotes: 1
Views: 892
Reputation: 6018
As others have mentioned, there's not necessarily a relation because characters simply look alike. Seems most similar examples online are mainly focused on removing accents and would use the standard Python library unicodedata
. It uses standard approaches to converting to ASCII like NFKD (NFKD explained here)
import unicodedata
str_unicode = u"ⁱ, ᴠ, Ғ, Ƭ, ѳ, ❶"
#replace = any characters that can't be translated will be replaced with ?
print(unicodedata.normalize('NFKD',str_unicode).encode("ascii",'replace'))
#will ignore any errors
print(unicodedata.normalize('NFKD',str_unicode).encode("ascii",'ignore'))
'i, ?, ?, ?, ?, ?'
'i, , , , , '
The unidecode
library seems closer for your specific example. I think you will have to augment it though with a translate
call to cleanup characters the library doesn't map.
I added a second example a character that couldn't be mapped. I added paragraph mark "¶" mapped to "P" for reference
import unicodedata
import unidecode
#Script
str_unicode = u"ⁱ, ᴠ, Ғ, Ƭ, ѳ, ❶, ¶"
dict_mapping = str.maketrans("❶¶","1P")
str_unidecode = unidecode.unidecode(str_unicode)
str_unidecode_translated = unidecode.unidecode(str_unicode.translate(dict_mapping))
print(str_unidecode)
print(str_unidecode_translated)
Upvotes: 1