nick_rinaldi
nick_rinaldi

Reputation: 707

Why isn't Beautiful Soup finding page element?

Bs4 noobie here. Tried more than a few methods to get this to work but now I'm straight up confused.

In trying to parse this page: https://www.basketball-reference.com/teams/NYK/2021.html

I am looking for a specific table using the below code

from urllib.request import urlopen
from bs4 import BeautifulSoup

year = 2021
team = "NYK"
team_url = f"https://www.basketball-reference.com/teams/{team}/{year}.html"
html = urlopen(team_url)
soup = BeautifulSoup(html, 'html.parser')
tbl = soup.find('table', {'id': 'team_misc'})
print(tbl)

My output is an empty list []

When I inspect the page, the table with an id team_misc exists. I'm looking at it with my own eyes. Yet my code returns nothing. Any obvious reason why? I won't list everything I've tried due to time, but if a suggestion is brought up, I'll say whether I tried it or not.

Thanks again!

Upvotes: 1

Views: 888

Answers (2)

drec4s
drec4s

Reputation: 8077

Since the table you are looking for is placed inside an HTML comment, a possible solution would be to parse these elements, and return when it finds the matching id.


from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment #import the Comment object

year = 2021
team = "NYK"
team_url = f"https://www.basketball-reference.com/teams/{team}/{year}.html"
html = urlopen(team_url)
soup = BeautifulSoup(html, 'html.parser')

comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
    ele = BeautifulSoup(c.strip(), 'html.parser')
    if tbl := ele.find("table"):
        if (tbl_id := tbl.get("id")) == "team_misc":
            print(tbl)

Upvotes: 0

Jonathan Leon
Jonathan Leon

Reputation: 5648

This gets the table you've identified. You'll need to download chromedriver.exe into your directory or provide the correct path to it.

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")

year = 2021
team = "NYK"
team_url = f"https://www.basketball-reference.com/teams/{team}/{year}.html"
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)

driver.get(team_url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
tbl = soup.find('table', {'id': 'team_misc'})
print(tbl)

Upvotes: 0

Related Questions