Reputation: 25
I am trying to scrape a website of my favorite college football team. There are two tables on the webpage that I would like to scrape, and the code I have written easily scrapes the first table. I am able to put it in a pandas dataframe and then into Excel. For some reason that I can't figure out I am unable to scrape the second table (the defensive table) from the site. I have tried a number of different methods to scrape the second table. I have tried just finding all tables, which finds the first table just fine but fails to find the second. I have tried using the listed attributes on the table, which didn't work either. Any help would be great appreciated! Below is the code I am using to attempt to scrape the second table:
from lxml import html
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
game_summary = 'https://www.sports-reference.com/cfb/schools/iowa/2018/gamelog/'
game_summary_response = requests.get(game_summary, timeout=30)
game_summary_content = BeautifulSoup(game_summary_response.text, 'html.parser')
deffensive_table = game_summary_content.find('table', id='defense')
defensive_game_summary = deffensive_table.find_all('tr')
When I run the program I just get the following error:
Traceback (most recent call last):
File "ncaa_stats_scrape.sh", line 24, in <module>
defensive_game_summary = deffensive_table.find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find_all'
Upvotes: 1
Views: 627
Reputation: 4030
The table you're looking for is included in the HTML that is returned, but as an HTML comment. The page includes some JavaScript that executes after page load to uncomment the table so it displays. The easiest way to get the contents is to use a library that can execute JavaScript after retrieving the page, like requests_html
. Example:
from requests_html import HTMLSession
url = 'https://www.sports-reference.com/cfb/schools/iowa/2018/gamelog/'
session = HTMLSession()
r = session.get(url)
r.html.render()
table = r.html.find('table#defense')
print(table.html)
Upvotes: 1
Reputation: 39
The error that you have posted basically means that value of deffensive_table
is None
.
That's why when you do a find_all
on that, you get an AttributeError
. A possible fix could be to do a None
check before
deffensive_table = game_summary_content.find('table', id='defense')
if deffensive_table is None:
defensive_game_summary = deffensive_table.find_all('tr')
else:
< some other logic to handle this case >
Upvotes: 1