Homunculus Reticulli
Homunculus Reticulli

Reputation: 68366

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.

The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.

One of the sections of code that is causing problems is shown below:

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

Here is a stack trace produced on SOME strings when the snippet above is run:

Traceback (most recent call last):
  File "foobar.py", line 792, in <module>
    p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.

Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?

Upvotes: 1512

Views: 2288523

Answers (30)

Carlost
Carlost

Reputation: 837

Well, i was facing the same issue. My real trouble was my app password, it had some spaces and the smtp worked when i removed them.

Upvotes: 0

agf
agf

Reputation: 176730

Read the Python Unicode HOWTO. This error is the very first example.

Do not use str() to convert from unicode to encoded text / bytes.

Instead, use .encode() to encode the string:

p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

or work entirely in unicode.

Upvotes: 1527

boyenec
boyenec

Reputation: 1617

You can you use unicodedata for avoid UnicodeEncodeError. here is an example:

import unicodedata

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = unicodedata.normalize("NFKD", agent_telno) #it will remove all unwanted character like '\xa0'
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

Upvotes: -1

Dreams
Dreams

Reputation: 6122

In case its an issue with a print statement, a lot fo times its just an issue with the terminal printing. This helped me : export PYTHONIOENCODING=UTF-8

Upvotes: 10

Caleb
Caleb

Reputation: 416

It works for me:

export LC_CTYPE="en_US.UTF-8"

Upvotes: 12

Babatunde Adeyemi
Babatunde Adeyemi

Reputation: 14438

You can set the character encoding to UTF-8 before running your script:

export LC_CTYPE="en_US.UTF-8"

This should generally resolve the issue.

Upvotes: 1

Pedro Lobito
Pedro Lobito

Reputation: 98861

Late answer, but this error is related to your terminal's encoding not supporting certain characters.
I fixed it on python3 using:

import sys
import io

sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')
print("é, à, ...")

Upvotes: 7

BuvinJ
BuvinJ

Reputation: 11048

Here's a rehashing of some other so-called "cop out" answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.

def safeStr(obj):
    try: return str(obj)
    except UnicodeEncodeError:
        return obj.encode('ascii', 'ignore').decode('ascii')
    except: return ""

Testing it:

if __name__ == '__main__': 
    print safeStr( 1 ) 
    print safeStr( "test" ) 
    print u'98\xb0'
    print safeStr( u'98\xb0' )

Results:

1
test
98°
98

UPDATE: My original answer was written for Python 2. For Python 3:

def safeStr(obj):
    try: return str(obj).encode('ascii', 'ignore').decode('ascii')
    except: return ""

Note: if you'd prefer to leave a ? indicator where the "unsafe" unicode characters are, specify replace instead of ignore in the call to encode for the error handler.

Suggestion: you might want to name this function toAscii instead? That's a matter of preference...

Finally, here's a more robust PY2/3 version using six, where I opted to use replace, and peppered in some character swaps to replace fancy unicode quotes and apostrophes which curl left or right with the simple vertical ones that are part of the ascii set. You might expand on such swaps yourself:

from six import PY2, iteritems 

CHAR_SWAP = { u'\u201c': u'"'
            , u'\u201D': u'"' 
            , u'\u2018': u"'" 
            , u'\u2019': u"'" 
}

def toAscii( text ) :    
    try:
        for k,v in iteritems( CHAR_SWAP ): 
            text = text.replace(k,v)
    except: pass     
    try: return str( text ) if PY2 else bytes( text, 'replace' ).decode('ascii')
    except UnicodeEncodeError:
        return text.encode('ascii', 'replace').decode('ascii')
    except: return ""

if __name__ == '__main__':     
    print( toAscii( u'testin\u2019' ) )

Upvotes: 22

Pe Dro
Pe Dro

Reputation: 3013

In general case of writing this unsupported encoding string (let's say data_that_causes_this_error) to some file (for e.g. results.txt), this works

f = open("results.txt", "w")
  f.write(data_that_causes_this_error.encode('utf-8'))
  f.close()

Upvotes: 5

huzefausama
huzefausama

Reputation: 443

This will work:

 >>>print(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", "bats\xc3\xa0")).encode('ascii', 'ignore'))

Output:

>>>bats

Upvotes: 0

Gulzar
Gulzar

Reputation: 27886

The recommended solution did not work for me, and I could live with dumping all non ascii characters, so

s = s.encode('ascii',errors='ignore')

which left me with something stripped that doesn't throw errors.

Upvotes: 5

sam ruben
sam ruben

Reputation: 385

Try to avoid conversion of variable to str(variable). Sometimes, It may cause the issue.

Simple tip to avoid :

try: 
    data=str(data)
except:
    data = data #Don't convert to String

The above example will solve Encode error also.

Upvotes: 3

shmakovpn
shmakovpn

Reputation: 831

This problem often happens when a django project deploys using Apache. Because Apache sets environment variable LANG=C in /etc/sysconfig/httpd. Just open the file and comment (or change to your flavior) this setting. Or use the lang option of the WSGIDaemonProcess command, in this case you will be able to set different LANG environment variable to different virtualhosts.

Upvotes: 1

hhh
hhh

Reputation: 52810

Alas this works in Python 3 at least...

Python 3

Sometimes the error is in the enviroment variables and enconding so

import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
... 
print(myText.encode('utf-8', errors='ignore'))

where errors are ignored in encoding.

Upvotes: 10

kenorb
kenorb

Reputation: 166289

In shell:

  1. Find supported UTF-8 locale by the following command:

    locale -a | grep "UTF-8"
    
  2. Export it, before running the script, e.g.:

    export LC_ALL=$(locale -a | grep UTF-8)
    

    or manually like:

    export LC_ALL=C.UTF-8
    
  3. Test it by printing special character, e.g. :

    python -c 'print(u"\u2122");'
    

Above tested in Ubuntu.

Upvotes: 35

kenorb
kenorb

Reputation: 166289

The problem is that you're trying to print a unicode character, but your terminal doesn't support it.

You can try installing language-pack-en package to fix that:

sudo apt-get install language-pack-en

which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).

On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.

Then when writing the code, make sure you use the right encoding in your code.

For example:

open(foo, encoding='utf-8')

If you've still a problem, double check your system configuration, such as:

  • Your locale file (/etc/default/locale), which should have e.g.

    LANG="en_US.UTF-8"
    LC_ALL="en_US.UTF-8"
    

    or:

    LC_ALL=C.UTF-8
    LANG=C.UTF-8
    
  • Value of LANG/LC_CTYPE in shell.

  • Check which locale your shell supports by:

    locale -a | grep "UTF-8"
    

Demonstrating the problem and solution in fresh VM.

  1. Initialize and provision the VM (e.g. using vagrant):

    vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
    

    See: available Ubuntu boxes..

  2. Printing unicode characters (such as trade mark sign like ):

    $ python -c 'print(u"\u2122");'
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
    
  3. Now installing language-pack-en:

    $ sudo apt-get -y install language-pack-en
    The following extra packages will be installed:
      language-pack-en-base
    Generating locales...
      en_GB.UTF-8... /usr/sbin/locale-gen: done
    Generation complete.
    
  4. Now problem should be solved:

    $ python -c 'print(u"\u2122");'
    ™
    
  5. Otherwise, try the following command:

    $ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
    ™
    

Upvotes: 41

palswim
palswim

Reputation: 12140

I had this issue trying to output Unicode characters to stdout, but with sys.stdout.write, rather than print (so that I could support output to a different file as well).

From BeautifulSoup's own documentation, I solved this with the codecs library:

import sys
import codecs

def main(fIn, fOut):
    soup = BeautifulSoup(fIn)
    # Do processing, with data including non-ASCII characters
    fOut.write(unicode(soup))

if __name__ == '__main__':
    with (sys.stdin) as fIn: # Don't think we need codecs.getreader here
        with codecs.getwriter('utf-8')(sys.stdout) as fOut:
            main(fIn, fOut)

Upvotes: 0

Many answers here (@agf and @Andbdrew for example) have already addressed the most immediate aspects of the OP question.

However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.

I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky's introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.

Upvotes: 3

ZF007
ZF007

Reputation: 3731

Update for python 3.0 and later. Try the following in the python editor:

locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8

This sets the system`s default locale encoding to the UTF-8 format.

More can be read here at PEP 538 -- Coercing the legacy C locale to a UTF-8 based locale.

Upvotes: 4

Ngoc-Vuong Ho
Ngoc-Vuong Ho

Reputation: 121

Please open terminal and fire the below command:

export LC_ALL="en_US.UTF-8"

Upvotes: 8

Aravind Krishnakumar
Aravind Krishnakumar

Reputation: 2777

Below solution worked for me, Just added

u "String"

(representing the string as unicode) before my string.

result_html = result.to_html(col_space=1, index=False, justify={'right'})

text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report.  Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)

Upvotes: 5

Max Korolevsky
Max Korolevsky

Reputation: 2721

I found elegant work around for me to remove symbols and continue to keep string as string in follows:

yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

It's important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):

>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'

Upvotes: 254

followben
followben

Reputation: 9197

We struck this error when running manage.py migrate in Django with localized fixtures.

Our source contained the # -*- coding: utf-8 -*- declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale.

The issue was simply that the Django container (we use docker) was missing the LANG env var.

Setting LANG to en_US.UTF-8 and restarting the container before re-running migrations fixed the problem.

Upvotes: 4

Nandan Kulkarni
Nandan Kulkarni

Reputation: 64

If you have something like packet_data = "This is data" then do this on the next line, right after initializing packet_data:

unic = u''
packet_data = unic

Upvotes: 1

Pereira
Pereira

Reputation: 739

I always put the code below in the first two lines of the python files:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

Upvotes: 12

Kairat Koibagarov
Kairat Koibagarov

Reputation: 1475

Just add to a variable encode('utf-8')

agent_contact.encode('utf-8')

Upvotes: 8

Drag0
Drag0

Reputation: 8908

I just used the following:

import unicodedata
message = unicodedata.normalize("NFKD", message)

Check what documentation says about it:

unicodedata.normalize(form, unistr) Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.

In addition to these two forms, there are two additional normal forms based on compatibility equivalence. In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).

The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.

Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.

Solves it for me. Simple and easy.

Upvotes: 6

Ashwin
Ashwin

Reputation: 2915

well i tried everything but it did not help, after googling around i figured the following and it helped. python 2.7 is in use.

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

Upvotes: 191

Andriy Ivaneyko
Andriy Ivaneyko

Reputation: 22021

Add line below at the beginning of your script ( or as second line):

# -*- coding: utf-8 -*-

That's definition of python source code encoding. More info in PEP 263.

Upvotes: 18

pepoluan
pepoluan

Reputation: 6780

I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:

# 'value' contains the problematic data
unic = u''
unic += value
value = unic

I had this idea after reading Ned's presentation.

I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.

Upvotes: 4

Related Questions