Ayush Saxena
Ayush Saxena

Reputation: 105

UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-17: ord inal not in range(128)

I am having a tough time in running the following code.

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
import csv

file = open("Test.CSV", "r")
reader = csv.reader(file)
for line in reader:
    text = line[5]
    lst = re.findall('(http.?://[^\s]+)', text)

    if not lst: print('Empty List')
    else:
        try:
            for url in lst:
                html = urllib.request.urlopen(url, context=ctx).read()
                soup = BeautifulSoup(html, 'html.parser')
                title = soup.title.string
                str_title = str (title)
                if 'Twitter' in str_title:
                    if len(lst) > 1: break
                    else: continue
                else:
                    print (str_title, ',', url)
        except urllib.error.HTTPError as err:
            if err.code == 404:
                print ('Invalid Twitter Link')

The above mentioned code reads a csv file, selects a column, then parses that using regex to get all the hyperlinks in a single row, I then use BeautifulSoup to parse through the Hyperlink to get the 'Title String' of the page.

Now, whenever i run this code, it stops working for a particular row, and throws an error "UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-17: ordinal not in range(128)"

How do I work my way with the Unicode String here ? Any help would be much appreciated.

Upvotes: 0

Views: 820

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 149135

The error message shows that the problem happens in urllib.request.urlopen(url, context=ctx). It looks like at least one of the URLs contains non ASCII characters.

What to do?

You can try to quote the URL:

html = urllib.request.urlopen(urllib.parse.quote(url, errors='ignore'), context=ctx).read()

This will prevent the UnicodeEncodeError, but will silently build an erroneous url which is likely to lead to problems later.

My advice is to catch the UnicodeEncodeError and display an error message that will help to understand what happens under the hood and how to actually fix it:

for url in lst:
    try:
        html = urllib.request.urlopen(url, context=ctx).read()
        soup = BeautifulSoup(html, 'html.parser')
        title = soup.title.string
        ...
    except UnicodeEncodeError as e:
        print("Incorrect URL {}".format(url.encode('ascii', errors='backslashreplace')))

The errors='backslashreplace' option will dump the code of the offending characters

Upvotes: 1

Related Questions