icycandy
icycandy

Reputation: 1263

python url decode %E3

I get some wikipedia URL from freebase dump:

url 1: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa

url 2: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%E3o_Costa

They both refer to the same page on wikipedia:

url 3: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brandão_Costa

urllib.unquote works on url 1

url = 'Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa'
url = urllib.unquote(url)
url = urllib.unquote(url)
print url

result is

Pedro_Miguel_de_Castro_Brandão_Costa

but not work on url 2.

url = 'Pedro_Miguel_de_Castro_Brand%E3o_Costa'
url = urllib.unquote(url)
print url

result is

Pedro_Miguel_de_Castro_Brand�o_Costa    

Are there something wrong?

Upvotes: 3

Views: 2964

Answers (1)

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799250

The former is double-quoted UTF-8, which prints out normally since your terminal uses UTF-8. The latter is quoted Latin-1, which requires decoding first.

>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'
Pedro_Miguel_de_Castro_Brand�o_Costa
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'.decode('latin-1')
Pedro_Miguel_de_Castro_Brandão_Costa

Upvotes: 4

Related Questions