Reputation: 43867

Pythonic way to ensure unicode in python 2 and 3

I'm working on porting a library so that it is compatible with both python 2 and 3. The library receives strings or string-like objects from the calling application and I need to ensure those objects get converted to unicode strings.

In python 2 I can do:

unicode_x = unicode(x)

In python 3 I can do:

unicode_x = str(x)

However, the best cross-version solution I have is:

def ensure_unicode(x):
  if sys.version_info < (3, 0):
    return unicode(x)
  return str(x)

which certainly doesn't seem great (although it works). Is there a better solution?

I am aware of unicode_literals and the u prefix but both of those solutions do not work as the inputs come from clients and are not literals in my library.

Upvotes: 14

Answers (3)

alick

Reputation: 327

If six.text_type(b'foo') -> "b'foo'" in Python 3 is not what you want as mentioned in Alex's answer, probably you want six.ensure_text(), available in six v1.12.0+.

In [17]: six.ensure_text(b'foo')
Out[17]: 'foo'

Ref: https://six.readthedocs.io/#six.ensure_text

Upvotes: 4

Alex Pizarro

Reputation: 71

Using six.text_type should suffice virtually always, just like the accepted answer says.

On a side note, and FYI, you could get yourself into trouble in Python 3 if you somehow feed a bytes instance to it, (although this should be really hard to do).

CONTEXT

six.text_type is basically an alias for str in Python 3:

>>> import six
>>> six.text_type
<class 'str'>

Surprisingly, using str to cast bytes instances gives somewhat unexpected results:

>>> six.text_type(b'bytestring')
"b'bytestring'"

Notice how our string just got mangled? Straight from str's docs:

Passing a bytes object to str() without the encoding or errors arguments falls under the first case of returning the informal string representation.

That is, str(...) will actually call the object's __str__ method, unless you pass an encoding:

>>> b'bytestring'.__str__()
"b'bytestring'"
>>> six.text_type(b'bytestring', encoding='utf-8')
'bytestring'

Sadly, if you do pass an encoding, "casting" regular str instances will no longer work:

>>> six.text_type('string', encoding='utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoding str is not supported

On a somewhat related note, casting None values can be troublesome as well:

>>> six.text_type(None)
'None'

You'll end up with a 'None' string, literally. Probably not what you wanted.

ALTERNATIVES

Just use six.text_type. Really. There's nothing to worry about unless you interact with bytes on purpose. Make sure to check for Nones before casting though.
Use Django's force_text. Safest way out of this madness if you happen to be working on a project that's already using Django 1.x.x.
Copy-paste Django's force_text to your project. Here's a sample implementation.

For either of the Django alternatives, keep in mind that force_text allows you to specify strings_only=True to neatly preserve None values:

>>> force_text(None)
'None'
>>> type(force_text(None))
<class 'str'>

>>> force_text(None, strings_only=True)
>>> type(force_text(None, strings_only=True))
<class 'NoneType'>

Be careful, though, as it won't cast several other primitive types as well:

>>> force_text(100)
'100'
>>> force_text(100, strings_only=True)
100
>>> force_text(True)
'True'
>>> force_text(True, strings_only=True)
True

Upvotes: 4

Martijn Pieters

Reputation: 1123500

Don't re-invent the compatibility layer wheel. Use the six compatibility layer, a small one-file project that can be included with your own:

Six supports every Python version since 2.6. It is contained in only one Python file, so it can be easily copied into your project. (The copyright and license notice must be retained.)

It includes a six.text_type() callable that does exactly this, convert a value to Unicode text:

import six

unicode_x = six.text_type(x)

In the project source code this is defined as:

import sys

PY2 = sys.version_info[0] == 2
PY3 = sys.version_info[0] == 3
# ...

if PY3:
    # ...
    text_type = str
    # ...

else:
    # ...
    text_type = unicode
    # ...

Upvotes: 23

Pythonic way to ensure unicode in python 2 and 3

Answers (3)

Related Questions