aviss
aviss

Reputation: 2449

How to fix UnicodeDecodeError: 'ascii' codec can't decode byte?

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

This is the error I get when trying to clean a list of names I extract using spaCy from an html page.

My code:

import urllib
import requests
from bs4 import BeautifulSoup
import spacy
from spacy.en import English
from __future__ import unicode_literals
nlp_toolkit = English()
nlp = spacy.load('en')

def get_text(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "lxml")

    # delete unwanted tags:
    for s in soup(['figure', 'script', 'style']):
        s.decompose()

    # use separator to separate paragraphs and subtitles!
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]

    text = ''.join(article_soup)
    return text

# using spacy
def get_names(all_tags):
    names=[]
    for ent in all_tags.ents:
        if ent.label_=="PERSON":
            names.append(str(ent))
    return names

def cleaning_names(names):
    new_names = [s.strip("'s") for s in names] # remove 's' from names
    myset = list(set(new_names)) #remove duplicates
    return myset

def main():
    url = "http://www.bbc.co.uk/news/uk-politics-39784164"
    text=get_text(url)
    text=u"{}".format(text)
    all_tags = nlp(text)
    names = get_person(all_tags)
    print "names:"
    print names
    mynewlist = cleaning_names(names)
    print mynewlist

if __name__ == '__main__':
    main()

For this particular URL I get the list of names which includes characters like £ or $:

['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May', 'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr Clegg', 'Theresa May']

And then the error:

Traceback (most recent call last) <ipython-input-19-8582e806c94a> in <module>()
     47 
     48 if __name__ == '__main__':
---> 49     main()

<ipython-input-19-8582e806c94a> in main()
     43     print "names:"
     44     print names
---> 45     mynewlist = cleaning_names(names)
     46     print mynewlist
     47 

<ipython-input-19-8582e806c94a> in cleaning_names(names)
     31 
     32 def cleaning_names(names):
---> 33     new_names = [s.strip("'s") for s in names] # remove 's' from names
     34     myset = list(set(new_names)) #remove duplicates
     35     return myset

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

I tried different ways of fixing unicode (including sys.setdefaultencoding('utf8')), nothing worked. I hope someone had the same issue before and will be able to suggest a fix. Thank you!

Upvotes: 0

Views: 7512

Answers (3)

alvas
alvas

Reputation: 122148

As @MarkRansom commented ignoring non-ascii character is going to bite you back.

First take a look at

Also, note this is an anti-pattern: Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

Easiest solution is to just use Python3 and that'll reduce some pain

>>> import requests
>>> from bs4 import BeautifulSoup
>>> import spacy
>>> nlp = spacy.load('en')

>>> url = "http://www.bbc.co.uk/news/uk-politics-39784164"
>>> html = requests.get(url).content
>>> bsoup = BeautifulSoup(html, 'html.parser')
>>> text = '\n'.join(p.text for d in bsoup.find_all( 'div', {'class': 'story-body__inner'}) for p in d.find_all('p') if p.text.strip())

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(text)
>>> names = [ent for ent in doc.ents if ent.ent_type_ == 'PERSON']

Upvotes: 1

aviss
aviss

Reputation: 2449

I finally fixed my code. I am surprised how easy it looks but it took me so long to get there and I saw so many people puzzled by the same problem so I decided to post my answer.

Adding this small function before passing names for further cleaning solved my problem.

def decode(names):        
    decodednames = []
    for name in names:
        decodednames.append(unicode(name, errors='ignore'))
    return decodednames

SpaCy still thinks that £59bn is a PERSON but it's ok with me, I can deal with this later in my code.

The working code:

import urllib
import requests
from bs4 import BeautifulSoup
import spacy
from spacy.en import English
from __future__ import unicode_literals
nlp_toolkit = English()
nlp = spacy.load('en')

def get_text(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "lxml")

    # delete unwanted tags:
    for s in soup(['figure', 'script', 'style']):
        s.decompose()

    # use separator to separate paragraphs and subtitles!
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]

    text = ''.join(article_soup)
    return text

# using spacy
def get_names(all_tags):
    names=[]
    for ent in all_tags.ents:
        if ent.label_=="PERSON":
            names.append(str(ent))
    return names

def decode(names):        
    decodednames = []
    for name in names:
        decodednames.append(unicode(name, errors='ignore'))
    return decodednames

def cleaning_names(names):
    new_names = [s.strip("'s") for s in names] # remove 's' from names
    myset = list(set(new_names)) #remove duplicates
    return myset

def main():
    url = "http://www.bbc.co.uk/news/uk-politics-39784164"
    text=get_text(url)
    text=u"{}".format(text)
    all_tags = nlp(text)
    names = get_person(all_tags)
    print "names:"
    print names
    decodednames = decode(names)
    mynewlist = cleaning_names(decodednames)
    print mynewlist

if __name__ == '__main__':
    main()

which gives me this with no errors:

names: ['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May', 'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr Clegg', 'Theresa May'] [u'Mr Clegg', u'Brexit', u'Nick Clegg', u'59bn', u'Theresa May']

Upvotes: 0

Mark Ransom
Mark Ransom

Reputation: 308402

When you get an decoding error with the 'ascii' codec, that's usually an indication that a byte string is being used in a context where a Unicode string is required (in Python 2, Python 3 won't allow it at all).

Since you've imported from __future__ import unicode_literals, the string "'s" is Unicode. This means the string you're trying to strip must be a Unicode string too. Fix that and you won't get the error anymore.

Upvotes: 1

Related Questions