Ayush Saxena
Ayush Saxena

Reputation: 105

AttributeError: 'NoneType' object has no attribute 'string'

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
import csv

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# text = input ('Enter Text - ') - In-case the user wants to manually put-in 
some text to evaluate
#print ('\n')
#print (len(lst))

# Take 'Content' input from a csv file
file = open("Test_1.CSV", "r", encoding='utf-8')
reader = csv.reader(file)
for line in reader:
text = line[5]
lst = re.findall('(http.?://[^\s]+)', text)

if not lst: print(line[0], 'Empty List')
else:
    try:
        for url in lst:
            try:
                try:
                    html = urllib.request.urlopen(url, context=ctx).read()
                    #html = urllib.request.urlopen(urllib.parse.quote(url, errors='ignore'), context=ctx).read()
                    soup = BeautifulSoup(html, 'html.parser')
                    title = soup.title.string
                    str_title = str (title)
                    if 'Twitter' in str_title:
                        if len(lst) > 1: break
                        else: continue
                    else:
                        print (line[0], str_title, ',', url)
                except UnicodeEncodeError as e:
                    #print("Incorrect URL {}".format(url.encode('ascii', errors='ignore')))
                    b_url = url.encode('ascii', errors='ignore')
                    n_url = b_url.decode("utf-8")
                    #print (n_url)
                    html = urllib.request.urlopen(n_url, context=ctx).read()
                    #html = urllib.request.urlopen(urllib.parse.quote(url, errors='ignore'), context=ctx).read()
                    soup = BeautifulSoup(html, 'html.parser')
                    title = soup.title.string
                    str_title = str (title)
                    if 'Twitter' in str_title:
                        if len(lst) > 1: break
                        else: continue
                    else:
                        print (line[0], str_title, ',', url)
            except urllib.error.URLError:
                print ('Invalid DNS Link')
    except urllib.error.HTTPError as err:
        if err.code == 404:
            print (line[0], 'Invalid Twitter Link')

The above mentioned code reads a csv file, selects a column, then parses that using regex to get all the hyperlinks in a single row, I then use BeautifulSoup to parse through the Hyperlink to get the 'Title String' of the page.

While running this code, I first encountered UnicodeEncodeError and addressed it; I then encountered urllib.error.URLError and addressed that too. Now, I've ran into another one

"Traceback (most recent call last): File "C:\Users\asaxena\Desktop\py4e\Gartner\crawler_new.py", line 32, in <modu le> title = soup.title.string AttributeError: 'NoneType' object has no attribute 'string'". 

Is there really any way for me to bypass any type of error that appears ? Even the unforseen ones ? I know BeautifulSoup has a tendency to throw up unexpected errors, partly due to the varied kind of content that roams on the web.

Upvotes: 0

Views: 1589

Answers (1)

Ayush Saxena
Ayush Saxena

Reputation: 105

I finally solved it, by placing the entire code under try / except block such that:

try: #Put all my code here except Exception as e: print ('Error Ignored')

The code will be able to handle all types of exceptions.

Upvotes: 1

Related Questions