GNMO11
GNMO11

Reputation: 2259

Beautiful Soup Pulling Data From Table

I am trying to pull the data from the Four Factors table on this website https://www.basketball-reference.com/boxscores/201101100CHA.html. I am having trouble getting to the table. I have tried

url = https://www.basketball-reference.com/boxscores/201101100CHA.html
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")

div = soup.find('div',id='all_four_factors')

Then when I try to use tr = div.find_all('tr') to pull the rows I am getting nothing back.

Upvotes: 0

Views: 61

Answers (3)

Bill M.
Bill M.

Reputation: 1548

I took a look at the HTML code you're trying to scrape, and the problem is that the tags you're trying to get are all within a comment section, <!-- Like this --->. BeautifulSoup treats the comments inside as just a bunch of text, not actual HTML code. So what you'll have to do is take the contents of the comment, then stick this string back into BeautifulSoup:

import requests
from bs4 import BeautifulSoup, Comment

url = 'https://www.basketball-reference.com/boxscores/201101100CHA.html'
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")

div = soup.find('div', id='all_four_factors')

# Get everything in here that's a comment
comments = div.find_all(text=lambda text:isinstance(text, Comment))

# Loop through each comment until you find the one that
# has the stuff you want.
for c in comments:

    # A perhaps crude but effective way of stopping at a comment
    # with HTML inside: see if the first character inside is '<'.
    if c.strip()[0] == '<':
        newsoup = BeautifulSoup(c.strip(), 'html.parser')
        tr = newsoup.find_all('tr')
        print(tr)

One caveat with this is that BS is going to assume that the commented-out code is valid, well-formed HTML. This works for me though, so if the page stays relatively the same it should continue to work.

Upvotes: 3

Kevin He
Kevin He

Reputation: 1250

If you look at list(div.children)[5], which is the only children that have tr as a substring in it, you'll realize that it is a Comment object, so there is technically no tr element under that div node. So div.find_all('tr') is expected to be empty.

Upvotes: 2

Ahmed
Ahmed

Reputation: 120

Why are you doing:

div = soup.find('div',id='all_four_factors')

This would get the following line and try to search for 'tr' tags in it.

<div id="all_four_factors" class="table_wrapper floated setup_commented commented">

You can just use your original soup variable from the first part and do

tr = soup.find_all('tr')

Upvotes: 0

Related Questions