Reputation: 2259
I am trying to pull the data from the Four Factors
table on this website https://www.basketball-reference.com/boxscores/201101100CHA.html. I am having trouble getting to the table. I have tried
url = https://www.basketball-reference.com/boxscores/201101100CHA.html
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")
div = soup.find('div',id='all_four_factors')
Then when I try to use tr = div.find_all('tr')
to pull the rows I am getting nothing back.
Upvotes: 0
Views: 61
Reputation: 1548
I took a look at the HTML code you're trying to scrape, and the problem is that the tags you're trying to get are all within a comment section, <!-- Like this --->
. BeautifulSoup treats the comments inside as just a bunch of text, not actual HTML code. So what you'll have to do is take the contents of the comment, then stick this string back into BeautifulSoup:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://www.basketball-reference.com/boxscores/201101100CHA.html'
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")
div = soup.find('div', id='all_four_factors')
# Get everything in here that's a comment
comments = div.find_all(text=lambda text:isinstance(text, Comment))
# Loop through each comment until you find the one that
# has the stuff you want.
for c in comments:
# A perhaps crude but effective way of stopping at a comment
# with HTML inside: see if the first character inside is '<'.
if c.strip()[0] == '<':
newsoup = BeautifulSoup(c.strip(), 'html.parser')
tr = newsoup.find_all('tr')
print(tr)
One caveat with this is that BS is going to assume that the commented-out code is valid, well-formed HTML. This works for me though, so if the page stays relatively the same it should continue to work.
Upvotes: 3
Reputation: 1250
If you look at list(div.children)[5]
, which is the only children that have tr
as a substring in it, you'll realize that it is a Comment
object, so there is technically no tr
element under that div
node. So div.find_all('tr')
is expected to be empty.
Upvotes: 2
Reputation: 120
Why are you doing:
div = soup.find('div',id='all_four_factors')
This would get the following line and try to search for 'tr' tags in it.
<div id="all_four_factors" class="table_wrapper floated setup_commented commented">
You can just use your original soup variable from the first part and do
tr = soup.find_all('tr')
Upvotes: 0