Reputation: 51
I am writing a simple web scraper to extract the game times for the ncaa basketball games. The code doesn't need to be pretty, just work. I have extracted the value from other span tags on the same page but for some reason I cannot get this one working.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.espn.com/mens-college-basketball/game/_/id/401123420'
response = requests.get(url)
soupy = soup(response.content, 'html.parser')
containers = soupy.findAll("div",{"class" : "team-container"})
for container in containers:
spans = container.findAll("span")
divs = container.find("div",{"class": "record"})
ranks = spans[0].text
team_name = spans[1].text
team_mascot = spans[2].text
team_abbr = spans[3].text
team_record = divs.text
time_container = soupy.find("span", {"class":"time game-time"})
game_times = time_container.text
refs_container = soupy.find("div", {"class" : "game-info-note__container"})
refs = refs_container.text
print(ranks)
print(team_name)
print(team_mascot)
print(team_abbr)
print(team_record)
print(game_times)
print(refs)
The specific code I am concerned about is this,
time_container = soupy.find("span", {"class":"time game-time"})
game_times = time_container.text
I just provided the rest of the code to show that the .text on other span tags work. The time is the only data I truly want. I just get an empty string with how my code is currently.
This is the output of the code I get when I call time_container
<span class="time game-time" data-dateformat="time1" data-showtimezone="true"></span>
or just '' when I do game_times.
Here is the line of the HTML from the website:
<span class="time game-time" data-dateformat="time1" data-showtimezone="true">6:10 PM CT</span>
I don't understand why the 6:10 pm is gone when I run the script.
Upvotes: 5
Views: 1830
Reputation: 84465
You can easily grab from an attribute on the page with requests
import requests
from bs4 import BeautifulSoup as bs
from dateutil.parser import parse
r = requests.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
soup = bs(r.content, 'lxml')
timing = soup.select_one('[data-date]')['data-date']
print(timing)
match_time = parse(timing).time()
print(match_time)
Upvotes: 1
Reputation: 311
An alternative would be to use some of ESPN's endpoints. These endpoints will return JSON responses. https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard
You can see other endpoints at this GitHub link https://gist.github.com/akeaswaran/b48b02f1c94f873c6655e7129910fc3b
This will make your application pretty light weight compared to running Selenium.
I recommend opening up inspect and going to the network tab. You can see all sorts of cool stuff happening. You can see all the requests that are happening in the site.
Upvotes: 2
Reputation: 71471
The site is dynamic, thus, you need to use selenium
:
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
game_time = soup(d.page_source, 'html.parser').find('span', {'class':'time game-time'}).text
Output:
'7:10 PM ET'
See full selenium
documentation here.
Upvotes: 3