88dlove
88dlove

Reputation: 25

Element not found on page using requests and BeautifulSoup

I am trying to scrape a website of my favorite college football team. There are two tables on the webpage that I would like to scrape, and the code I have written easily scrapes the first table. I am able to put it in a pandas dataframe and then into Excel. For some reason that I can't figure out I am unable to scrape the second table (the defensive table) from the site. I have tried a number of different methods to scrape the second table. I have tried just finding all tables, which finds the first table just fine but fails to find the second. I have tried using the listed attributes on the table, which didn't work either. Any help would be great appreciated! Below is the code I am using to attempt to scrape the second table:

from lxml import html
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

game_summary = 'https://www.sports-reference.com/cfb/schools/iowa/2018/gamelog/'
game_summary_response = requests.get(game_summary, timeout=30)
game_summary_content = BeautifulSoup(game_summary_response.text, 'html.parser')
deffensive_table = game_summary_content.find('table', id='defense')
defensive_game_summary = deffensive_table.find_all('tr')

When I run the program I just get the following error:

Traceback (most recent call last):
  File "ncaa_stats_scrape.sh", line 24, in <module>
    defensive_game_summary = deffensive_table.find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find_all'

Upvotes: 1

Views: 627

Answers (2)

Chris Hunt
Chris Hunt

Reputation: 4030

The table you're looking for is included in the HTML that is returned, but as an HTML comment. The page includes some JavaScript that executes after page load to uncomment the table so it displays. The easiest way to get the contents is to use a library that can execute JavaScript after retrieving the page, like requests_html. Example:

from requests_html import HTMLSession


url = 'https://www.sports-reference.com/cfb/schools/iowa/2018/gamelog/'
session = HTMLSession()
r = session.get(url)

r.html.render()

table = r.html.find('table#defense')
print(table.html)

Upvotes: 1

Shaurya Kumar
Shaurya Kumar

Reputation: 39

The error that you have posted basically means that value of deffensive_table is None.

That's why when you do a find_all on that, you get an AttributeError. A possible fix could be to do a None check before

deffensive_table = game_summary_content.find('table', id='defense')
if deffensive_table is None:
    defensive_game_summary = deffensive_table.find_all('tr')
else:
    < some other logic to handle this case >

Upvotes: 1

Related Questions