Bin Chen
Bin Chen

Reputation: 63309

How to convert a string to utf-8 in Python

I have a browser which sends utf-8 characters to my Python server, but when I retrieve it from the query string, the encoding that Python returns is ASCII. How can I convert the plain string to utf-8?

NOTE: The string passed from the web is already UTF-8 encoded, I just want to make Python to treat it as UTF-8 not ASCII.

Upvotes: 247

Views: 1045362

Answers (13)

George Fonseca
George Fonseca

Reputation: 61

The url is translated to ASCII and to the Python server it is just a Unicode string, eg.: "T%C3%A9st%C3%A3o"

Python understands "é" and "ã" as actual %C3%A9 and %C3%A3.

You can encode an URL just like this:

import urllib
url = "T%C3%A9st%C3%A3o"
print(urllib.parse.unquote(url))
>> Téstão

See https://www.adamsmith.haus/python/answers/how-to-decode-a-utf-8-url-in-python for details.

Upvotes: 6

haccks
haccks

Reputation: 106012

You can use python's standard library codecs module.

import codecs
codecs.decode(b'Decode me', 'utf-8')

Upvotes: 1

Kevin
Kevin

Reputation: 1

you can also do this:

from unidecode import unidecode
unidecode(yourStringtoDecode)

Upvotes: 0

Blueswannabe
Blueswannabe

Reputation: 251

Might be a bit overkill, but when I work with ascii and unicode in same files, repeating decode can be a pain, this is what I use:

def make_unicode(inp):
    if type(inp) != unicode:
        inp =  inp.decode('utf-8')
    return inp

Upvotes: 25

shioko
shioko

Reputation: 312

  • First, str in Python is represented in Unicode.
  • Second, UTF-8 is an encoding standard to encode Unicode string to bytes. There are many encoding standards out there (e.g. UTF-16, ASCII, SHIFT-JIS, etc.).

When the client sends data to your server and they are using UTF-8, they are sending a bunch of bytes not str.

You received a str because the "library" or "framework" that you are using, has implicitly converted some random bytes to str.

Under the hood, there is just a bunch of bytes. You just need ask the "library" to give you the request content in bytes and you will handle the decoding yourself (if library can't give you then it is trying to do black magic then you shouldn't use it).

  • Decode UTF-8 encoded bytes to str: bs.decode('utf-8')
  • Encode str to UTF-8 bytes: s.encode('utf-8')

Upvotes: 4

user225312
user225312

Reputation: 131647

In Python 2

>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)

^ This is the difference between a byte string (plain_string) and a unicode string.

>>> s = "Hello!"
>>> u = unicode(s, "utf-8")

^ Converting to unicode and specifying the encoding.

In Python 3

All strings are unicode. The unicode function does not exist anymore. See answer from @Noumenon

Upvotes: 315

David-Star
David-Star

Reputation: 35

Yes, You can add

# -*- coding: utf-8 -*-

in your source code's first line.

You can read more details here https://www.python.org/dev/peps/pep-0263/

Upvotes: -1

Zld Productions
Zld Productions

Reputation: 349

In Python 3.6, they do not have a built-in unicode() method. Strings are already stored as unicode by default and no conversion is required. Example:

my_str = "\u221a25"
print(my_str)
>>> √25

Upvotes: 13

Joe9008
Joe9008

Reputation: 654

Translate with ord() and unichar(). Every unicode char have a number asociated, something like an index. So Python have a few methods to translate between a char and his number. Downside is a ñ example. Hope it can help.

>>> C = 'ñ'
>>> U = C.decode('utf8')
>>> U
u'\xf1'
>>> ord(U)
241
>>> unichr(241)
u'\xf1'
>>> print unichr(241).encode('utf8')
ñ

Upvotes: 5

Willem
Willem

Reputation: 1334

city = 'Ribeir\xc3\xa3o Preto'
print city.decode('cp1252').encode('utf-8')

Upvotes: 15

Ken
Ken

Reputation: 369

Adding the following line to the top of your .py file:

# -*- coding: utf-8 -*-

allows you to encode strings directly in your script, like this:

utfstr = "ボールト"

Upvotes: 16

duhaime
duhaime

Reputation: 27594

If the methods above don't work, you can also tell Python to ignore portions of a string that it can't convert to utf-8:

stringnamehere.decode('utf-8', 'ignore')

Upvotes: 85

codeape
codeape

Reputation: 100766

If I understand you correctly, you have a utf-8 encoded byte-string in your code.

Converting a byte-string to a unicode string is known as decoding (unicode -> byte-string is encoding).

You do that by using the unicode function or the decode method. Either:

unicodestr = unicode(bytestr, encoding)
unicodestr = unicode(bytestr, "utf-8")

Or:

unicodestr = bytestr.decode(encoding)
unicodestr = bytestr.decode("utf-8")

Upvotes: 13

Related Questions