Reputation: 1263
I get some wikipedia URL from freebase dump:
url 1: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa
url 2: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%E3o_Costa
They both refer to the same page on wikipedia:
url 3: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brandão_Costa
urllib.unquote
works on url 1
url = 'Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa'
url = urllib.unquote(url)
url = urllib.unquote(url)
print url
result is
Pedro_Miguel_de_Castro_Brandão_Costa
but not work on url 2.
url = 'Pedro_Miguel_de_Castro_Brand%E3o_Costa'
url = urllib.unquote(url)
print url
result is
Pedro_Miguel_de_Castro_Brand�o_Costa
Are there something wrong?
Upvotes: 3
Views: 2964
Reputation: 799250
The former is double-quoted UTF-8, which prints out normally since your terminal uses UTF-8. The latter is quoted Latin-1, which requires decoding first.
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'
Pedro_Miguel_de_Castro_Brand�o_Costa
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'.decode('latin-1')
Pedro_Miguel_de_Castro_Brandão_Costa
Upvotes: 4