Alejandro Veintimilla
Alejandro Veintimilla

Reputation: 11523

Python how to solve Unicode Error in string

I'm getting the classical error:

ascii' codec can't decode byte 0xc3 in position 28: ordinal not in range(128)

This time, I can't solve it. The error comes from this line:

mensaje_texto_inmobiliaria = "%s, con el email %s y el teléfono %s está se ha contactado con Inmobiliar" % (nombre, email, telefono)

Specifically, from the teléfono word. I have tried adding # -*- coding: utf-8 -*- to the beginning of the file, adding unicode( <string> ) and also <string>.encode("utf-8"). Nothing worked. Any advice will help.

Upvotes: 0

Views: 6091

Answers (1)

Thtu
Thtu

Reputation: 2032

This is in response to why this solves the issue OP is having, and somebackground on the issue OP is trying describe

from __future__ import unicode_literals
from builtins import str

In the default iPython 2.7 kernel :

(iPython session)

In [1]: type("é") # By default, quotes in py2 create py2 strings, which is the same thing as a sequence of bytes that given some encoding, can be decoded to a character in that encoding.
Out[1]: str

In [2]: type("é".decode("utf-8")) # We can get to the actual text data by decoding it if we know what encoding it was initially encoded in, utf-8 is a safe guess in almost every country but Myanmar.
Out[2]: unicode

In [3]: len("é") # Note that the py2 `str` representation has a length of 2.  There's one byte for the "e" and one byte for the accent.  
Out[3]: 2

In [4]: len("é".decode("utf-8")) # the py2 `unicode` representation has length 1, since an accented e is a single character
Out[4]: 1

Some other things of note in python 2.7:

  • "é" is the same thing as str("é")
  • u"é" is the same thing as "é".decode('utf-8') or unicode("é", 'utf-8')
  • u"é".encode('utf-8') is the same thing as str("é")
  • You typically call decode with a py2 str, and encode with py2 unicode.
    • Due to early design issues, you can call both on either even though that doesn't really make any sense.
    • In python3, str, which is the same as python2 unicode, can no longer be decoded since a string by definition is a decoded sequence of bytes. By default, it uses the utf-8 encoding.
  • Byte sequences that were encoded with in the ascii codec behave exactly the same as their decoded counterparts.
    • In python 2.7 with no future imports : type("a".decode('ascii')) gives a unicode object, but this behaves nearly identically with str("a"). This is not the case in python3.

With that said, here's what the snippets above do :

  • __future__ is a module maintained by the core python team that backports python3 functionality to python2 to allow you to use python3 idioms within python2.
  • from __future__ import unicode_literals has the following effect :
    • Without the future import "é" is the same thing as str("é")
    • With the future import "é" is functionally the same thing as unicode("é")
  • builtins is a module that is approved by the core python team, and contains safe aliases for using python3 idioms in python2 with the python3 api.
    • Due to reasons beyond me, the package itself is named "future", so to install the builtins module you run : pip install future
  • from builtins import str has the following effect :
    • the str constructor now gives what you think it does, i.e. text data in the form of python2 unicode objects. So it's functionally the same thing as str = unicode
    • Note : Python3 str is functionally the same as Python2 unicode
    • Note : To get bytes, you can use the "bytes" prefix, e.g. b'é'

The takeaway is this :

  1. Decode on read/Decode early on and encode on write/encode at the end
  2. Use str objects for bytes and unicode objects for text

Upvotes: 3

Related Questions