Reputation: 3
I am a bit new to webscraping and wanted to scrape few HTML tables using BeautifulSoup in Python. The webpage is https://fbref.com/en/comps/9/keepers/Premier-League-Stats. As you will see there are two tables "Squad Goalkeeping" and "Player Goalkeeping".
Using the following code I am able to capture both tables.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
url = 'https://fbref.com/en/comps/9/keepers/Premier-League-Stats'
html_content = requests.get(url).text
bs = BeautifulSoup(html_content,"lxml")
gk_stats = bs.find_all("div",attrs={"class":"table_wrapper"})
gk_stats contains 2 elements "Squad Goalkeeping" and "Player Goalkeeping", which I can see by indexing gk_stats[0] and gk_stats[1], respectively. However, when I try to find the "tr" tag in "Player Goalkeeping" it gives me an empty list.
gk_stats[1].find_all("tr")
Could anybody please explain to me why I cannot extract the table even though I have it as a BeautifulSoup element? I can also see the table when I inspect the element in Chrome browser.
I am able to extract "Squad Goalkeeping" table using the same command but with index 0 gk_stats[0].find_all("tr")
Thanks in advance.
Upvotes: 0
Views: 320
Reputation: 9649
The problem is that the table is commented. A quick fix is to remove <!--
and -->
from the html code. Also, you can load html tables directly into pandas with read_html
(no need for BeautifulSoup):
import requests
import pandas as pd
url = 'https://fbref.com/en/comps/9/keepers/Premier-League-Stats'
html_content = requests.get(url).text.replace('<!--', '').replace('-->', '')
df = pd.read_html(html_content)
read_html
will return a list of tables as dataframes, which can be accessed with df[0]
, df[1]
etc.
Player goalkeeping is in df[2]
. Let's remove the top header row and the mid-table header rows:
df[2].columns = df[2].columns.droplevel(0) # drop top header row
df[2] = df[2][df[2]['Rk'].ne('Rk')].reset_index() # remove mid-table header rows
Output df[2]
:
index | Rk | Player | Nation | Pos | Squad | Age | Born | MP | Starts | Min | 90s | GA | GA90 | SoTA | Saves | Save% | W | D | L | CS | CS% | PKatt | PKA | PKsv | PKm | Save% | Matches | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | Adrián | es ESP | GK | Liverpool | 34-068 | 1987 | 3 | 3 | 270 | 3 | 9 | 3 | 19 | 10 | 52.6 | 1 | 1 | 1 | 1 | 33.3 | 0 | 0 | 0 | 0 | nan | Matches |
1 | 1 | 2 | Rúnar Alex Rúnarsson | is ISL | GK | Arsenal | 26-022 | 1995 | 1 | 0 | 16 | 0.2 | 0 | 0 | 2 | 2 | 100 | 0 | 0 | 0 | 0 | nan | 0 | 0 | 0 | 0 | nan | Matches |
2 | 2 | 3 | Alisson | br BRA | GK | Liverpool | 28-161 | 1992 | 23 | 23 | 2070 | 23 | 26 | 1.13 | 73 | 50 | 69.9 | 10 | 6 | 7 | 5 | 21.7 | 8 | 4 | 1 | 3 | 20 | Matches |
3 | 3 | 4 | Alphonse Areola | fr FRA | GK | Fulham | 28-013 | 1993 | 27 | 27 | 2430 | 27 | 30 | 1.11 | 113 | 88 | 77.9 | 5 | 11 | 11 | 9 | 33.3 | 5 | 5 | 0 | 0 | 0 | Matches |
4 | 4 | 5 | Kepa Arrizabalaga | es ESP | GK | Chelsea | 26-160 | 1994 | 4 | 4 | 360 | 4 | 6 | 1.5 | 18 | 12 | 66.7 | 2 | 1 | 1 | 1 | 25 | 0 | 0 | 0 | 0 | nan | Matches |
Upvotes: 1