Reputation: 1038
I am reading a file in utf-8 into unicode and I do not get any errors.
try:
f = codecs.open(fil_name, "r","utf-8")
f_str = f.read()
That is, the string f_str is in "unicode" Later in the program I have to send the (u) string in f_str to a socket. I am trying to convert the string back to "utf-8".
usock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
usock.connect(("xxx server", 123))
usock.send("TEXT %s\nENDQ\n" % f_str.replace("\n", " ").encode("utf-8"))
here I am getting an error message:
usock.send("TEXT %s\nENDQ\n" % text.replace("\n", " ").encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
In my text, I have characters that cannot be coded with pure ASCII (äö..) but it is not a problem with utf-8 or latin-1. Why I am getting this error? I am not using ASCII, I am using unicode/utf-8???
Upvotes: 0
Views: 2053
Reputation: 879083
The error occurs on this line
usock.send("TEXT %s\nENDQ\n" % text.replace("\n", " ").encode("utf-8"))
I can reproduce a similar error this way:
In [23]: text = 'äö'
In [24]: 'TEXT %s'%text.replace("n", " ").encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Although you've shown that f_str
is unicode
, somehow, text
is a str
object. Some extra processing you are doing between f_str
and text
is probably making text
a str
.
If you can convert all input to unicode, work with them as unicode, and only convert back to a specific encoding upon output (as needed), your problem should be fixed.
Upvotes: 0
Reputation: 43024
Your string literal is a byte string. When you try to inperpolate into it Python will implicitly try to convert to byte string using the default encoding (ascii).
There are a couple of ways to fix this. One is just use Python 3. ;-)
If you are using Python 2 then put the following at the top of the source file:
from __future__ import unicode_literals
Then your literal will be unicode also.
You could also prefix the string with a 'u'.
Another problem with that line is precedence. The '%s' format operation is what is trying to convert your unicode into a string implicitly, using the ascii codec, after the right side is complete.
So, try this:
(u"TEXT %s\nENDQ\n" % f_str.replace(u"\n", u" ")).encode("utf-8")
Upvotes: 1
Reputation: 12068
begin with checking for the obvious python unicode checklist:
-*- encoding:utf-8 -*-
at the top of every source filealso
why do you need to encode('utf-8') if it is already unicode? what error message do you get if you don't do that?
and did you try to explicitly declare f_str as unicode: like
f_str=unicode(f_str)
also try printing f_str and check if you are getting the right result before.. maybe this is a problem with the data
Upvotes: 0