Reputation: 514

How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

I am trying to get a string to use in google geocoding api.I ve checked a lot of threads but I am still facing problem and I don't understand how to solve it.

I need addresse1 to be a string without any special characters. Addresse1 is for example: "32 rue d'Athènes Paris France".

addresse1= collect.replace(' ','+').replace('\n','') 
addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')

here I got a string without any accent... Ho no... It is not a string but a bytes. So I ve done what was suggested and 'decode:

addresse1=addresse1.decode('utf-8')

But then addresse1 is exactly the same than at the begining... What do I have to do? What am I doing wrong? Or what i don't understand with unicode? Or is there a better solution?

Thanks,

Stéphane.

Upvotes: 14

Answers (5)

VectorVictor

Reputation: 820

I had a similar problem where I was generating tags that users might have to type with their phone.

Without using 3rd party packages you can simplify bobinces's answer above:

collect = "32 rue d'Athènes Paris France"
unicode_collect = unicodedata.normalize('NFD', collect)
address1 = unicode_collect.encode('ascii', 'ignore').decode('utf-8')

address1:
"32 rue d'Athenes Paris France"

Upvotes: 2

J. Doe

Reputation: 81

Generally, there are two approaches: (1) regular expressions and (2) str.translate.

1) regular expressions

Decompose string and replace characters from the Unicode block \u0300-\u036f:

import unicodedata
import re
word = unicodedata.normalize("NFD", word)
word = re.sub("[\u0300-\u036f]", "", word)

It removes accents, circumflex, diaeresis, and so on:

pingüino > pinguino
εἴκοσι εἶσι > εικοσι εισι

For some languages, it could be another block, such as [\u0559-\u055f] for Armenian script.

2) str.translate

First, create replacement table (case-sensitive) and then apply it.

repl = str.maketrans(
    "áéúíó",
    "aeuio"
)
word.translate(repl)

Multi-char replacements are made as following:

repl = {
    ord("æ"): "ae",
    ord("œ"): "oe",
}
word.translate(repl)

Upvotes: 3

Ignacio Vazquez-Abrams

Reputation: 799240

with 3rd party package: unidecode

3>> unidecode.unidecode("32 rue d'Athènes Paris France")
"32 rue d'Athenes Paris France"

Upvotes: 28

bobince

Reputation: 536675

addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')

You probably meant .encode('ascii', 'ignore'), to remove non-ASCII characters. UTF-8 contains all characters, so encoding to it doesn't get rid of any, and an encode-decode cycle with it is a no-op.

is there a better solution?

It depends what you are trying to do.

If you only want to remove diacritical marks and not lose all other non-ASCII characters, you could read unicodedata.category for each character after NFKD-normalising and remove those in category M.

If you want to transliterate to ASCII that becomes a language-specific question that requires custom replacements (for example in German ö becomes oe, but not in Swedish).

If you just want to fudge a string into ASCII because having non-ASCII characters in it causes some code to break, it is of course much better to fix that code to work properly with all Unicode characters than to mangle good data. The letter è is not encodable in ASCII, but neither are 99.9989% of all characters so that hardly makes it “special”. Code that only supports ASCII is lame.

The Google Geocoding API can work with Unicode perfectly well so there is no obvious reason you should need to do any of this.

ETA:

url2= 'maps.googleapis.com/maps/api/geocode/json?address=' + addresse1 ...

Ah, you need to URL-encode any data you inject into a URL. That's not just for Unicode — the above will break for many ASCII punctuation symbols too. Use urllib.quote to encode a single string, or urllib.encode to convert multiple parameters:

params = dict(
    address=address1.encode('utf-8'),
    key=googlekey
)
url2 = '...?' + urllib.urlencode(params)

(in Python 3 it's urllib.parse.quote and urllib.parse.encode and they automatically choose UTF-8 so you don't have to manually encode there.)

data2 = urllib.request.urlopen(url2).read().decode('utf-8')
data3=json.loads(data2)

json.loads reads byte strings so you should be safe to omit the UTF-8 decode. Anyway json.load will read directly from a file-like object so you shouldn't have to load the data into a string at all:

data3 = json.load(urllib.request.urlopen(url2))

Upvotes: 4

G4A

Reputation: 195

You can use the translate() method from python. Here's an example copied from tutorialspoint.com:

#!/usr/bin/python

from string import maketrans   # Required to call maketrans function.

intab = "aeiou"
outtab = "12345"
trantab = maketrans(intab, outtab)

str = "this is string example....wow!!!";
print str.translate(trantab)

This outputs:

th3s 3s str3ng 2x1mpl2....w4w!!!

So you can define what characters you wish to replace more easily than with replace()

Upvotes: 0

How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

Answers (5)

1) regular expressions

2) str.translate

Related Questions