Cosinux
Cosinux

Reputation: 321

Use unicode strings instead of regular strings? (Python 2.7)

As far as I know there is a difference between strings and unicode strings in Python. But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?

So when I get a text input, I don't need to use unicode()?

I might sound lazy but I am just interested if this is possible...

p.s. I don't know a lot about character encoding so please correct me if I got anything wrong

Upvotes: 1

Views: 12051

Answers (3)

jfs
jfs

Reputation: 414305

But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?

There are two type of strings in Python (on both Python 2 and 3): a bytestring (a sequence of bytes) and a Unicode string (a sequence of Unicode codepoints).

bytestring = b'abc'
unicode_text = u'abc'

The type of string created using 'abc' string literal depends on Python version and the presence of from __future__ import unicode_literals import. Without the import on Python 2, 'abc' literal creates a bytestring otherwise it creates a Unicode string.

Add the encoding declaration at the top of your Python source file if you use non-ascii characters in string literals e.g.: # -*- coding: utf-8 -*-.

So when I get a text input, I don't need to use unicode()?

If by "text input" you mean that your program receives bytes somehow (from a file, network, from the command-line) then no: you shouldn't rely on Python to convert bytes to Unicode implicitly -- you should do it explicitly as soon as you receive the bytes using unicode_text = bytestring.decode(character_encoding).

And in reverse, keep the text as Unicode inside your program. Convert Unicode strings to bytes as late as possible when it is necessary (e.g., to send the text via the network).

Use bytestrings to work with a binary data: an image, a compressed content, etc. Use Unicode strings to work with text in Python.

To read Unicode from a file, use io.open() (you have to know the correct character encoding if it is not locale.getpreferredencoding(False)).

What character encoding to use when you receive your Unicode text via network may depend on the corresponding protocol e.g., the charset can be specified in Content-Type http header:

    text = data.decode(response.headers.getparam('charset'))

You could use universal_newlines=True or io.TextIOWrapper() to get Unicode text from an external process started using subprocess module. It can be non-trivial to figure out what character encoding should be used on Windows (if you read Russian, see the gory details here: Byte при печати вывода внешней команды).

Upvotes: 3

Lily
Lily

Reputation: 31

For Example(In pyhon interactive,diff in GUI Shell) :

>>> s = '你好'
>>> s
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> us = u'你好'
>>> us
u'\u4f60\u597d'
>>> print type(s)
<type 'str'>
>>> print type(us)
<type 'unicode'>
>>> len(s)
6
>>> len(us)
2

In short:
First, a string object is a sequence of characters,a Unicode string is a sequence of code points(Unicode code units), which are numbers from 0 to 0x10ffff.
Them, len(string) will reture a set of bytes,len(unicode) will return a number of characters.This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
I think you should use raw_input to instead input, if you want to get bytestring.

Upvotes: 3

Mark Tolonen
Mark Tolonen

Reputation: 177715

In Python 2.6+ you can use from __future__ import unicode_literals, but that only makes string literals Unicode. All functions that returned byte strings still return byte strings.

Example:

>>> s = 'abc'
>>> type(s)
<type 'str'>
>>> from __future__ import unicode_literals
>>> s = 'abc'
>>> type(s)
<type 'unicode'>

For the behavior you want, use Python 3.

Upvotes: 2

Related Questions