jakecm
jakecm

Reputation: 21

Neither pandas.read_html nor BeautifulSoup can find all tables on webpage

I am trying to get the 3rd and 6th tables from a webpage (https://www.pro-football-reference.com/years/2021/) but pandas.read_html and BeautifulSoup are both only finding the first two tables on the page. Here is what I've tried.

url = 'https://www.pro-football-reference.com/years/2021/'

data_pd = pd.read_html(url)
print(len(data_pd))

Output:

2

and also

url = 'https://www.pro-football-reference.com/years/2021/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for table in soup.find_all('table'):
    print(table.get('class'))

Output:

['sortable', 'stats_table']
['sortable', 'stats_table']

I am guessing it has something to do with the way the webpage is formatted, but is there anything I can do to grab the tables that I need?

Upvotes: 1

Views: 129

Answers (1)

chitown88
chitown88

Reputation: 28565

Yes you could use Selenium to let the page render then pull in the html. However I try to avoid Selenium if I could as to avoid the overhead.

The better option though is through the simple request, the static html does have the other tables in there, but within the comments. You could do a) BeautifulSoup does have the ability to pull out the Comments to then parse those tables. Or simply remove the comment tags and then parse.

import requests
import pandas as pd

url = 'https://www.pro-football-reference.com/years/2021/'
response = requests.get(url).text.replace("<!--","").replace("-->","")

data_pd = pd.read_html(response)
print(len(data_pd))

Output:

print(len(data_pd))
13

OR Using BEautifulSoup to co through the comments:

import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd

url = 'https://www.pro-football-reference.com/years/2021/'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')

comments = data.find_all(string=lambda text: isinstance(text, Comment))

data_pd = pd.read_html(url)
for each in comments:
    if '<table' in str(each):
        data_pd.append(pd.read_html(str(each))[0])
        
print(len(data_pd))

Upvotes: 1

Related Questions