Reputation: 11832
I have a string like "pe%20to%C5%A3i%20mai". When I apply urllib.parse.unquote to it, I get "pe to\u0163i mai". If I try to write this to a file, I get those exact simbols, not the expected glyph.
How can I transform the string to utf-8 so in the file I have the proper glyph instead?
Edit: I'm using Python 3.2
Edit2: So I figured out that the urllib.parse.unquote
was working correctly, and my problem actually is that I'm serializing to YAML with yaml.dump
and that seems to screw things up. Why?
Upvotes: 0
Views: 4894
Reputation: 414215
Update: If the output file is a yaml document then you could ignore \u0163
in it. Unicode escapes are valid in yaml documents.
#!/usr/bin/env python3
import json
# json produces a subset of yaml
print(json.dumps('pe toţi mai')) # -> "pe to\u0163i mai"
print(json.dumps('pe toţi mai', ensure_ascii=False)) # -> "pe toţi mai"
Note: no \u
in the last case. Both lines represent the same Python string.
yaml.dump()
has similar option: allow_unicode
. Set it to True
to avoid Unicode escapes.
The url is correct. You don't need to do anything with it:
#!/usr/bin/env python3
from urllib.parse import unquote
url = "pe%20to%C5%A3i%20mai"
text = unquote(url)
with open('some_file', 'w', encoding='utf-8') as file:
def p(line):
print(line, file=file) # write line to file
p(text) # -> pe toţi mai
p(repr(text)) # -> 'pe toţi mai'
p(ascii(text)) # -> 'pe to\u0163i mai'
p("pe to\u0163i mai") # -> pe toţi mai
p(r"pe to\u0163i mai") # -> pe to\u0163i mai
#NOTE: r'' prefix
The \u0163
sequence might be introduced by character encoding error handler:
with open('some_other_file', 'wb') as file: # write bytes
file.write(text.encode('ascii', 'backslashreplace')) # -> pe to\u0163i mai
Or:
with open('another', 'w', encoding='ascii', errors='backslashreplace') as file:
file.write(text) # -> pe to\u0163i mai
More examples:
# introduce some more \u escapes
b = r"pe to\u0163i mai ţţţ".encode('ascii', 'backslashreplace') # bytes
print(b.decode('ascii')) # -> pe to\u0163i mai \u0163\u0163\u0163
# remove unicode escapes
print(b.decode('unicode-escape')) # -> pe toţi mai ţţţ
Upvotes: 4
Reputation: 11832
The urllib.parse.unquote
returned a correct UTF-8 string and writing that straight to the file returned did the expected result. The problem was with yaml. By default it doesn't encode with UTF-8.
My solution was to do:
yaml.dump("pe%20to%C5%A3i%20mai",encoding="utf-8").decode("unicode-escape")
Thanks to J.F. Sebastian and Mark Byers for asking me the right questions that helped me figure out the problem!
Upvotes: 1
Reputation: 838206
Python 3
Calling urllib.parse.unquote
returns a Unicode string already:
>>> urllib.parse.unquote("pe%20to%C5%A3i%20mai")
'pe toţi mai'
If you don't get that result, it must be an error in your code. Please post your code.
Python 2
Use decode
to get a Unicode string from a bytestring:
>>> import urllib2
>>> print urllib2.unquote("pe%20to%C5%A3i%20mai").decode('utf-8')
pe toţi mai
Remember that when you write a Unicode string to a file you have to encode it again. You could choose to write to the file as UTF-8, but you could also choose a different encoding if you wished. You also have to remember to use the same encoding when reading back from the file. You may find the codecs
module useful for specifying an encoding when reading from and writing to files.
>>> import urllib2, codecs
>>> s = urllib2.unquote("pe%20to%C5%A3i%20mai").decode('utf-8')
>>> # Write the string to a file.
>>> with codecs.open('test.txt', 'w', 'utf-8') as f:
... f.write(s)
>>> # Read the string back from the file.
>>> with codecs.open('test.txt', 'r', 'utf-8') as f:
... s2 = f.read()
One potentially confusing issue is that in the interactive interpreter Unicode strings are sometimes displayed using the \uxxxx
notation instead of the actual characters:
>>> s
u'pe to\u0163i mai'
>>> print s
pe toţi mai
This does not mean that the string is "wrong". It's just the way the interpreter works.
Upvotes: 1
Reputation: 11173
Try decode
using unicode_escape
.
E.g.:
>>> print "pe to\u0163i mai".decode('unicode_escape')
pe toţi mai
Upvotes: 1