Chris Macaluso
Chris Macaluso

Reputation: 1482

BeautifulSoup Exception mid loop scraping HTML file

I'm trying to scrape a local folder of HTML files for a couple of variables but I'm getting an exception about halfway through the loop. The exception is AttributeError: 'NoneType' object has no attribute 'contents. It is not actually .contents I've looked at the file it gets hung up on and it's structured exactly the same as the other files. If you remove .contents then you just raise the same exception but with the find() function. Anyone know why this is happening? Again many of the files process without a problem. My code is below:

df_list = []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file)
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        audience_score = soup.find('div', class_ = 'audience-score meter').find('span').contents[0][:-1]
        num_audience_ratings = soup.find('div', class_ = 'audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div') [1].contents[2].strip().replace(',', '')
    
    
        # print(num_audience_ratings)
        # break
           
        df_list.append({'title': title,
                        'audience_score': int(audience_score),
                        'number_of_audience_ratings': int(num_audience_ratings)})
df = pd.DataFrame(df_list, columns = ['title', 'audience_score', 'number_of_audience_ratings'])

Upvotes: 2

Views: 163

Answers (1)

Bitto
Bitto

Reputation: 8225

My guess is that some of the files do not have the attributes you are looking for.

Eg.

 audience_score = soup.find('div', class_ = 'audience-score meter').find('span').contents[0][:-1]

If there is no div with the class audience-score meter then soup.find('div', class_ = 'audience-score meter') will return None . Any subsequent find or contents on this will result in an AttributeError

A solution would be to try-except this and set the value to empty string.

try:    
    audience_score = soup.find('div', class_ = 'audience-score meter').find('span').contents[0][:-1]
except AttributeError:
    audience_score=""  

Do the same for title and num_audience_ratings(both assignments)

Upvotes: 3

Related Questions