Reputation: 1560
I want to try scraping all the tables form this website.This website contains more than 10
tables.When I use pd.read_html()
,it returns only 3 tables but I expect that my script return all the tables.
My script:
import pandas as pd
url = "https://aws.pro-football-reference.com/teams/mia/2000.htm"
df = pd.read_html(url)
len(df)
Output:
3
Specially, I want this table:
How can I get all the tables using pd.read_html()
?
Upvotes: 0
Views: 353
Reputation: 917
pd.read_html
uses BeautifulSoup under the hood to scrape <table>
elements from the webpage. Using requests
to grab HTML for the webpage and parsing it manually, I found that the page you linked indeed contains only three <table>
elements. However, the data for several additional tables (including the "kicking" one you want) can be found in HTML comments.
Parse the commented-out tables.
import requests
import bs4
import pandas as pd
url = "https://aws.pro-football-reference.com/teams/mia/2000.htm"
scraped_html = requests.get(url)
soup = bs4.BeautifulSoup(scraped_html.content)
# Get all html comments, then filter out everything that isn't a table
comments = soup.find_all(text=lambda text:isinstance(text, bs4.Comment))
commented_out_tables = [bs4.BeautifulSoup(cmt).find_all('table') for cmt in comments]
# Some of the entries in `commented_out_tables` are empty lists. Remove them.
commented_out_tables = [tab[0] for tab in commented_out_tables if len(tab) == 1]
print(len(commented_out_tables))
Gives 8
.
Only one of these is the "kicking" table. We can find it by looking for a table
with the id
attribute set to kicking
.
for table in commented_out_tables:
if table.get('id') == 'kicking':
kicking_table = table
break
Turn this into a pd.DataFrame
with pd.read_html
.
pd.read_html(str(kicking_table))
Yields the following:
[ Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Games ... Kickoffs Punting
No. Player Age Pos G GS ... KOAvg Pnt Yds Lng Blck Y/P
0 1.0 Matt Turk 32.0 p 16 0.0 ... NaN 92.0 3870.0 70.0 0.0 42.1
1 10.0 Olindo Mare 27.0 k 16 0.0 ... 60.3 NaN NaN NaN NaN NaN
2 NaN Team Total 27.3 NaN 16 NaN ... 60.3 92.0 3870.0 70.0 0.0 42.1
3 NaN Opp Total NaN NaN 16 NaN ... NaN 87.0 3532.0 NaN NaN 40.6
[4 rows x 32 columns]]
Upvotes: 1