Reputation: 105
I am having a tough time in running the following code.
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
import csv
file = open("Test.CSV", "r")
reader = csv.reader(file)
for line in reader:
text = line[5]
lst = re.findall('(http.?://[^\s]+)', text)
if not lst: print('Empty List')
else:
try:
for url in lst:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
title = soup.title.string
str_title = str (title)
if 'Twitter' in str_title:
if len(lst) > 1: break
else: continue
else:
print (str_title, ',', url)
except urllib.error.HTTPError as err:
if err.code == 404:
print ('Invalid Twitter Link')
The above mentioned code reads a csv file, selects a column, then parses that using regex to get all the hyperlinks in a single row, I then use BeautifulSoup to parse through the Hyperlink to get the 'Title String' of the page.
Now, whenever i run this code, it stops working for a particular row, and throws an error "UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-17: ordinal not in range(128)"
How do I work my way with the Unicode String here ? Any help would be much appreciated.
Upvotes: 0
Views: 820
Reputation: 149135
The error message shows that the problem happens in urllib.request.urlopen(url, context=ctx)
. It looks like at least one of the URLs contains non ASCII characters.
What to do?
You can try to quote the URL:
html = urllib.request.urlopen(urllib.parse.quote(url, errors='ignore'), context=ctx).read()
This will prevent the UnicodeEncodeError
, but will silently build an erroneous url which is likely to lead to problems later.
My advice is to catch the UnicodeEncodeError and display an error message that will help to understand what happens under the hood and how to actually fix it:
for url in lst:
try:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
title = soup.title.string
...
except UnicodeEncodeError as e:
print("Incorrect URL {}".format(url.encode('ascii', errors='backslashreplace')))
The errors='backslashreplace'
option will dump the code of the offending characters
Upvotes: 1