Reputation: 11832

Decoding UTF-8 URL in Python

I have a string like "pe%20to%C5%A3i%20mai". When I apply urllib.parse.unquote to it, I get "pe to\u0163i mai". If I try to write this to a file, I get those exact simbols, not the expected glyph.

How can I transform the string to utf-8 so in the file I have the proper glyph instead?

Edit: I'm using Python 3.2

Edit2: So I figured out that the urllib.parse.unquote was working correctly, and my problem actually is that I'm serializing to YAML with yaml.dump and that seems to screw things up. Why?

Upvotes: 0

Answers (4)

jfs

Reputation: 414215

Update: If the output file is a yaml document then you could ignore \u0163 in it. Unicode escapes are valid in yaml documents.

#!/usr/bin/env python3
import json

# json produces a subset of yaml
print(json.dumps('pe toţi mai')) # -> "pe to\u0163i mai"
print(json.dumps('pe toţi mai', ensure_ascii=False)) # -> "pe toţi mai"

Note: no \u in the last case. Both lines represent the same Python string.

yaml.dump() has similar option: allow_unicode. Set it to True to avoid Unicode escapes.

The url is correct. You don't need to do anything with it:

#!/usr/bin/env python3
from urllib.parse import unquote

url =  "pe%20to%C5%A3i%20mai"
text = unquote(url)

with open('some_file', 'w', encoding='utf-8') as file:
    def p(line):
        print(line, file=file) # write line to file

    p(text)                # -> pe toţi mai
    p(repr(text))          # -> 'pe toţi mai'
    p(ascii(text))         # -> 'pe to\u0163i mai'

    p("pe to\u0163i mai")  # -> pe toţi mai
    p(r"pe to\u0163i mai") # -> pe to\u0163i mai
    #NOTE: r'' prefix

The \u0163 sequence might be introduced by character encoding error handler:

with open('some_other_file', 'wb') as file: # write bytes
    file.write(text.encode('ascii', 'backslashreplace')) # -> pe to\u0163i mai

Or:

with open('another', 'w', encoding='ascii', errors='backslashreplace') as file:
    file.write(text) # -> pe to\u0163i mai

More examples:

# introduce some more \u escapes
b = r"pe to\u0163i mai ţţţ".encode('ascii', 'backslashreplace') # bytes
print(b.decode('ascii')) # -> pe to\u0163i mai \u0163\u0163\u0163
# remove unicode escapes
print(b.decode('unicode-escape')) # -> pe toţi mai ţţţ

Upvotes: 4

rolisz

Reputation: 11832

The urllib.parse.unquote returned a correct UTF-8 string and writing that straight to the file returned did the expected result. The problem was with yaml. By default it doesn't encode with UTF-8.

My solution was to do:

yaml.dump("pe%20to%C5%A3i%20mai",encoding="utf-8").decode("unicode-escape")

Thanks to J.F. Sebastian and Mark Byers for asking me the right questions that helped me figure out the problem!

Upvotes: 1

Mark Byers

Reputation: 838206

Python 3

Calling urllib.parse.unquote returns a Unicode string already:

>>> urllib.parse.unquote("pe%20to%C5%A3i%20mai")
'pe toţi mai'

If you don't get that result, it must be an error in your code. Please post your code.

Python 2

Use decode to get a Unicode string from a bytestring:

>>> import urllib2
>>> print urllib2.unquote("pe%20to%C5%A3i%20mai").decode('utf-8')
pe toţi mai

Remember that when you write a Unicode string to a file you have to encode it again. You could choose to write to the file as UTF-8, but you could also choose a different encoding if you wished. You also have to remember to use the same encoding when reading back from the file. You may find the codecs module useful for specifying an encoding when reading from and writing to files.

>>> import urllib2, codecs
>>> s = urllib2.unquote("pe%20to%C5%A3i%20mai").decode('utf-8')

>>> # Write the string to a file.
>>> with codecs.open('test.txt', 'w', 'utf-8') as f:
...     f.write(s)

>>> # Read the string back from the file.
>>> with codecs.open('test.txt', 'r', 'utf-8') as f:
...     s2 = f.read()

One potentially confusing issue is that in the interactive interpreter Unicode strings are sometimes displayed using the \uxxxx notation instead of the actual characters:

>>> s
u'pe to\u0163i mai'
>>> print s
pe toţi mai

This does not mean that the string is "wrong". It's just the way the interpreter works.

Upvotes: 1

Maria Zverina

Reputation: 11173

Try decode using unicode_escape.

E.g.:

>>> print "pe to\u0163i mai".decode('unicode_escape')
pe toţi mai

Upvotes: 1

Decoding UTF-8 URL in Python

Answers (4)

Related Questions