Reputation: 5115
FTR I have written quite a few scrapers successfully in both frameworks but I'm stumped. Here is a screenshot of the data I'm trying to scrape (you can also go to the actual link in the get request):
I attempt to target the div.section_content
:
import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml").text
soup = BeautifulSoup(html)
soup.findAll("div", {"class": "section_content"})
Printing the last line shows some other divs, but not the one with the pitching data.
However, I can see it's in the text, so it's not a javascript triggered loading problem (the phrase "Pitching" only comes up in that table):
>>> "Pitching" in soup.text
True
Here is an abbreviated version of one of the golang attempts:
package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector(
colly.AllowedDomains("www.baseball-reference.com"),
)
c.OnHTML("div.table_wrapper", func(e *colly.HTMLElement) {
fmt.Println(e.ChildText("div.section_content"))
})
c.Visit("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml")
} }
Upvotes: 1
Views: 2026
Reputation: 6967
It looks to me like the HTML is actually commented out, so that's why BeautifulSoup can't find it. Either remove the comment markers from the HTML string before you parse it or use BeautifulSoup to extract the comments and parse the return value.
For example:
for element in soup(text=lambda text: isinstance(text, Comment)):
comment = element.extract()
comment_soup = BeautifulSoup(comment)
# work with comment_soup
Upvotes: 4