Reputation: 105
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
import csv
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
# text = input ('Enter Text - ') - In-case the user wants to manually put-in
some text to evaluate
#print ('\n')
#print (len(lst))
# Take 'Content' input from a csv file
file = open("Test_1.CSV", "r", encoding='utf-8')
reader = csv.reader(file)
for line in reader:
text = line[5]
lst = re.findall('(http.?://[^\s]+)', text)
if not lst: print(line[0], 'Empty List')
else:
try:
for url in lst:
try:
try:
html = urllib.request.urlopen(url, context=ctx).read()
#html = urllib.request.urlopen(urllib.parse.quote(url, errors='ignore'), context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
title = soup.title.string
str_title = str (title)
if 'Twitter' in str_title:
if len(lst) > 1: break
else: continue
else:
print (line[0], str_title, ',', url)
except UnicodeEncodeError as e:
#print("Incorrect URL {}".format(url.encode('ascii', errors='ignore')))
b_url = url.encode('ascii', errors='ignore')
n_url = b_url.decode("utf-8")
#print (n_url)
html = urllib.request.urlopen(n_url, context=ctx).read()
#html = urllib.request.urlopen(urllib.parse.quote(url, errors='ignore'), context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
title = soup.title.string
str_title = str (title)
if 'Twitter' in str_title:
if len(lst) > 1: break
else: continue
else:
print (line[0], str_title, ',', url)
except urllib.error.URLError:
print ('Invalid DNS Link')
except urllib.error.HTTPError as err:
if err.code == 404:
print (line[0], 'Invalid Twitter Link')
The above mentioned code reads a csv file, selects a column, then parses that using regex to get all the hyperlinks in a single row, I then use BeautifulSoup to parse through the Hyperlink to get the 'Title String' of the page.
While running this code, I first encountered UnicodeEncodeError
and addressed it; I then encountered urllib.error.URLError
and addressed that too. Now, I've ran into another one
"Traceback (most recent call last): File "C:\Users\asaxena\Desktop\py4e\Gartner\crawler_new.py", line 32, in <modu le> title = soup.title.string AttributeError: 'NoneType' object has no attribute 'string'".
Is there really any way for me to bypass any type of error that appears ? Even the unforseen ones ? I know BeautifulSoup has a tendency to throw up unexpected errors, partly due to the varied kind of content that roams on the web.
Upvotes: 0
Views: 1589
Reputation: 105
I finally solved it, by placing the entire code under try / except block such that:
try: #Put all my code here except Exception as e: print ('Error Ignored')
The code will be able to handle all types of exceptions.
Upvotes: 1