user20273
user20273

Reputation: 145

replace non-ASCII characters to UTF-8 format

I would like to know if there is any way to find a character (uft-8 format) equivalent to a non-ASCII character.

I've done several tests with the unidecode library, but it doesn't completely satisfy what I need.

For example, consider this characters:

import unidecode

x = 'ⁱ, ᴠ, Ғ, Ƭ, ѳ, ❶'
unidecode.unidecode(x)

Output = "i, V, G', T, f, "

I would like something like: "i, v, f, t, o, 1"

There must be a way? Thanks in advance for any help!!

Upvotes: 1

Views: 892

Answers (1)

Stephan
Stephan

Reputation: 6018

As others have mentioned, there's not necessarily a relation because characters simply look alike. Seems most similar examples online are mainly focused on removing accents and would use the standard Python library unicodedata. It uses standard approaches to converting to ASCII like NFKD (NFKD explained here)

More Common unicodedata Approach

import unicodedata
str_unicode = u"ⁱ, ᴠ, Ғ, Ƭ, ѳ, ❶"

#replace = any characters that can't be translated will be replaced with ?
print(unicodedata.normalize('NFKD',str_unicode).encode("ascii",'replace'))
#will ignore any errors
print(unicodedata.normalize('NFKD',str_unicode).encode("ascii",'ignore'))

unicodedata Output

'i, ?, ?, ?, ?, ?'
'i, , , , , '

unidecode Map with Translate

The unidecode library seems closer for your specific example. I think you will have to augment it though with a translate call to cleanup characters the library doesn't map.

I added a second example a character that couldn't be mapped. I added paragraph mark "¶" mapped to "P" for reference

import unicodedata
import unidecode

#Script
str_unicode = u"ⁱ, ᴠ, Ғ, Ƭ, ѳ, ❶, ¶"
dict_mapping = str.maketrans("❶¶","1P")

str_unidecode = unidecode.unidecode(str_unicode)
str_unidecode_translated = unidecode.unidecode(str_unicode.translate(dict_mapping))

print(str_unidecode)
print(str_unidecode_translated)

Upvotes: 1

Related Questions