vishrut88
vishrut88

Reputation: 3

Cannot scrape specific table using BeautifulSoup

I am a bit new to webscraping and wanted to scrape few HTML tables using BeautifulSoup in Python. The webpage is https://fbref.com/en/comps/9/keepers/Premier-League-Stats. As you will see there are two tables "Squad Goalkeeping" and "Player Goalkeeping".

Using the following code I am able to capture both tables.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re


url = 'https://fbref.com/en/comps/9/keepers/Premier-League-Stats'
html_content = requests.get(url).text

bs = BeautifulSoup(html_content,"lxml")
gk_stats = bs.find_all("div",attrs={"class":"table_wrapper"})

gk_stats contains 2 elements "Squad Goalkeeping" and "Player Goalkeeping", which I can see by indexing gk_stats[0] and gk_stats[1], respectively. However, when I try to find the "tr" tag in "Player Goalkeeping" it gives me an empty list.

gk_stats[1].find_all("tr")

Could anybody please explain to me why I cannot extract the table even though I have it as a BeautifulSoup element? I can also see the table when I inspect the element in Chrome browser.

I am able to extract "Squad Goalkeeping" table using the same command but with index 0 gk_stats[0].find_all("tr")

Thanks in advance.

Upvotes: 0

Views: 320

Answers (1)

RJ Adriaansen
RJ Adriaansen

Reputation: 9649

The problem is that the table is commented. A quick fix is to remove <!-- and --> from the html code. Also, you can load html tables directly into pandas with read_html (no need for BeautifulSoup):

import requests
import pandas as pd

url = 'https://fbref.com/en/comps/9/keepers/Premier-League-Stats'
html_content = requests.get(url).text.replace('<!--', '').replace('-->', '')
df = pd.read_html(html_content)

read_html will return a list of tables as dataframes, which can be accessed with df[0], df[1] etc. Player goalkeeping is in df[2]. Let's remove the top header row and the mid-table header rows:

df[2].columns = df[2].columns.droplevel(0) # drop top header row
df[2] = df[2][df[2]['Rk'].ne('Rk')].reset_index() # remove mid-table header rows 

Output df[2]:

index Rk Player Nation Pos Squad Age Born MP Starts Min 90s GA GA90 SoTA Saves Save% W D L CS CS% PKatt PKA PKsv PKm Save% Matches
0 0 1 Adrián es ESP GK Liverpool 34-068 1987 3 3 270 3 9 3 19 10 52.6 1 1 1 1 33.3 0 0 0 0 nan Matches
1 1 2 Rúnar Alex Rúnarsson is ISL GK Arsenal 26-022 1995 1 0 16 0.2 0 0 2 2 100 0 0 0 0 nan 0 0 0 0 nan Matches
2 2 3 Alisson br BRA GK Liverpool 28-161 1992 23 23 2070 23 26 1.13 73 50 69.9 10 6 7 5 21.7 8 4 1 3 20 Matches
3 3 4 Alphonse Areola fr FRA GK Fulham 28-013 1993 27 27 2430 27 30 1.11 113 88 77.9 5 11 11 9 33.3 5 5 0 0 0 Matches
4 4 5 Kepa Arrizabalaga es ESP GK Chelsea 26-160 1994 4 4 360 4 6 1.5 18 12 66.7 2 1 1 1 25 0 0 0 0 nan Matches

Upvotes: 1

Related Questions