confetti
confetti

Reputation: 1100

Proper use of unicode characters in python3 - Force utf-8 encoding

I'm going crazy here. The internet and this SO question tell me that in python 3.x, the default encoding is UTF-8. In addition to that, my system's default encoding is UTF-8. In addition to that, I have # -*- coding: utf-8 -*- at the top of my python 3.5 file.

Still, python is using ascii:

# -*- coding: utf-8 -*-
mystring = "Ⓐ"
print(mystring)

Greets me with:

SyntaxError: 'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128)

I've also tried this: print(mystring.encode("utf-8")) and .decode("utf-8") - Same thing.

What am I missing here? How do I force python to stop using ascii encoding?


Edit: I know that it seems weird to complain about position 7 with a one character string, but this is my actual MCVE and the exact output I'm getting. The above is using python shell, the below is in a script. Both use python 3.5.2.


Edit: Since I figured it might be relevant: The string I'm getting comes from an external application and is not hardcoded, so I need a way to get that utf-8 string and save it into a file. The above is just a minimalized and generalized example. Here is my real-life code:

# the variables being a string that might contain unicode characters
mystring = "username: " + fromuser + " | printname: " + fromname
with open("myfile.txt", "a") as myfile:
  myfile.write(mystring + "\n")

Upvotes: 1

Views: 6332

Answers (2)

sehafoc
sehafoc

Reputation: 876

In Python3 all strings are unicode, so the problem you're having is likely due to your locale settings not being correct. The Python3 interpreter looks to use the locale environment variables and if it cannot find them it emulates basic ASCII

From locale.py:

except ImportError:

    # Locale emulation

    CHAR_MAX = 127
    LC_ALL = 6
    LC_COLLATE = 3
    LC_CTYPE = 0
    LC_MESSAGES = 5
    LC_MONETARY = 4
    LC_NUMERIC = 1
    LC_TIME = 2
    Error = ValueError

Double check the locale on your shell from which you are executing. Here are a few work arounds you can try to see if they get you working before you go through the task of getting your env setup correctly.

1) Validate UTF-8 locale or language files are installed (see link above)

2) Try adding this to the top of your script

#!/usr/bin/env LC_ALL=en_US.UTF-8 /usr/local/bin/python3
print('カタカナ')

or

#!/usr/bin/env LANG=en_US.UTF-8 /usr/local/bin/python3
print('カタカナ')

Or export shell variables before executing the Python interpreter

export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
python3
>>> print('カタカナ')

Sorry I cannot be more specific, as these settings are platform and OS specific. You can forcefully attempt to set the locale in Python directly using the locale module, but I don't recommend that, and it won't help if they are not installed.

Hope that helps.

Upvotes: 5

J. Blackadar
J. Blackadar

Reputation: 1961

What's new in Python 3.0 says:

All text is Unicode; however encoded Unicode is represented as binary data

If you want to try outputting utf-8, here's an example:

b'\x41'.decode("utf-8", "strict")

If you'd like to use unicode in a string literal, use the unicode escape and its coded representation. For your example:

print("\u24B6")

Upvotes: 0

Related Questions