andreSmol
andreSmol

Reputation: 1038

Python 2 socket and string coding

I am reading a file in utf-8 into unicode and I do not get any errors.

try:
        f = codecs.open(fil_name, "r","utf-8")
        f_str = f.read()

That is, the string f_str is in "unicode" Later in the program I have to send the (u) string in f_str to a socket. I am trying to convert the string back to "utf-8".

usock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
usock.connect(("xxx server", 123))
usock.send("TEXT %s\nENDQ\n" % f_str.replace("\n", " ").encode("utf-8"))

here I am getting an error message:

usock.send("TEXT %s\nENDQ\n" % text.replace("\n", " ").encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)

In my text, I have characters that cannot be coded with pure ASCII (äö..) but it is not a problem with utf-8 or latin-1. Why I am getting this error? I am not using ASCII, I am using unicode/utf-8???

Upvotes: 0

Views: 2053

Answers (3)

unutbu
unutbu

Reputation: 879083

The error occurs on this line

usock.send("TEXT %s\nENDQ\n" % text.replace("\n", " ").encode("utf-8"))

I can reproduce a similar error this way:

In [23]: text = 'äö'

In [24]: 'TEXT %s'%text.replace("n", " ").encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Although you've shown that f_str is unicode, somehow, text is a str object. Some extra processing you are doing between f_str and text is probably making text a str.

If you can convert all input to unicode, work with them as unicode, and only convert back to a specific encoding upon output (as needed), your problem should be fixed.

Upvotes: 0

Keith
Keith

Reputation: 43024

Your string literal is a byte string. When you try to inperpolate into it Python will implicitly try to convert to byte string using the default encoding (ascii).

There are a couple of ways to fix this. One is just use Python 3. ;-)

If you are using Python 2 then put the following at the top of the source file:

from __future__ import unicode_literals

Then your literal will be unicode also.

You could also prefix the string with a 'u'.

Another problem with that line is precedence. The '%s' format operation is what is trying to convert your unicode into a string implicitly, using the ascii codec, after the right side is complete.

So, try this:

(u"TEXT %s\nENDQ\n" % f_str.replace(u"\n", u" ")).encode("utf-8")

Upvotes: 1

alonisser
alonisser

Reputation: 12068

begin with checking for the obvious python unicode checklist:

  1. putting -*- encoding:utf-8 -*- at the top of every source file
  2. checking if the text file encoding is utf-8 (most default is ascii 1255)

also

why do you need to encode('utf-8') if it is already unicode? what error message do you get if you don't do that?

and did you try to explicitly declare f_str as unicode: like

f_str=unicode(f_str)

also try printing f_str and check if you are getting the right result before.. maybe this is a problem with the data

Upvotes: 0

Related Questions