Reputation: 351
Im using a library unidecode to convert accentred strings to ascii represented stirngs.
>>> accented_string = u'Málaga'
# accented_string is of type 'unicode'
>>> import unidecode
>>> unidecode.unidecode(accented_string)
>>> Malaga
But the problem is I'm reading the string from a file how do I send it to the 'unidecode' library.
for name in strings:
print unidecode.unidecode(u+name) #?????
I can't get my head around it? if I encode it that just gives me the wrong encoding.
Upvotes: 0
Views: 10639
Reputation: 50190
We still don't know the type of your pandas column, so here are two versions for Python 2:
If strings
is already a sequence of Unicode strings (type(name)
is unicode
):
for name in strings:
print unidecode.unidecode(name)
If the elements of strings
are regular Python 2 str
(type(name)
is str
):
for name in strings:
print unidecode.unidecode(name.decode("utf-8"))
This will work _if your strings are stored in the UTF-8 encoding. Otherwise you'll have to supply the appropriate encoding, e.g. "latin-1"
etc.
In Python 3, the first version should work; you'll have to sort out your encoding issues before you get to this point, i.e. when you first read in your data from disk.
Upvotes: 1
Reputation: 351
I have a work around which was too simple, just decode the read string back to a unicode string and then pass it to the 'unidecode' library.
>>> accented_string = 'Málaga'
>>> accented_string_u = accented_string.decode('utf-8')
>>> import unidecode
>>> unidecode.unidecode(accented_string_u)
>>> Malaga
Upvotes: 1
Reputation: 5537
Use the unicodedata.normalize:
accented_string = u"Málaga"
unicodedata.normalize( "NFKD", accented_string ).encode( "ascii", "ignore" )
There are 4 normalized forms that you can use: "NFC", "NFKC", "NFD", and "NFKD".
Here is the details for using it as in the documentation linked above:
The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.
In addition to these two forms, there are two additional normal forms based on compatibility equivalence. In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).
The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.
Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.
Upvotes: 0