joshblech
joshblech

Reputation: 25

css selector in beautiful soup not finding a tag

There are plenty of similar questions to this, but none have answered my question. I am trying to use a CSS selector to find a tag in beautiful soup.

The specific section of html I am trying to scrape, as the full html is quite large

The url I am scraping from is in my code.

here is some test code that hopefully shows my problem:

url = "https://www.basketball-reference.com/boxscores/201510310MEM.html"
response = urlopen(url)
html = response.read().decode()

# proves the element I am selecting exists in the html
print(html.find("table class=\"suppress_all stats_table\" id=\"four_factors\" data-cols-to-freeze=\",1\"")) 

soup = BeautifulSoup(html, 'html.parser')

# this line prints a similar piece of data to the one I want, but not correct
print(soup.select('tbody > tr > td[data-stat="off_rtg"]')[0].get_text())

# when I try being more specific, it prints an empty list
print(soup.select('table[id="four_factors"] tbody > tr > td[data-stat="off_rtg"]'))

Output:

78720
98
[]

As my code illustrates, an element that can be found using python's String.find() method is for some reason invisible to BeautifulSoup. I've tried using the BeautifulSoup.find() and .findAll() instead of a css selector with the same results. I've tried using the lxml parser with the same results.

Upvotes: 1

Views: 257

Answers (1)

MendelG
MendelG

Reputation: 20008

This is happening because the table is within HTML comments (<!--...-->).

You can extract the table checking if the tags are of the type Comment:

from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment

url = "https://www.basketball-reference.com/boxscores/201510310MEM.html"
response = urlopen(url)
html = response.read().decode()

soup = BeautifulSoup(html, "html.parser")
comments = soup.find_all(text=lambda tag: isinstance(tag, Comment))
comment_soup = BeautifulSoup(str(comments), "html.parser")

print(
    comment_soup.select_one(
        'table[id="four_factors"] tbody > tr > td[data-stat="off_rtg"]'
    ).text
)

Output:

102.5

Upvotes: 1

Related Questions