zezima
zezima

Reputation: 51

How to extract value from span tag

I am writing a simple web scraper to extract the game times for the ncaa basketball games. The code doesn't need to be pretty, just work. I have extracted the value from other span tags on the same page but for some reason I cannot get this one working.

from bs4 import BeautifulSoup as soup
import requests

url = 'http://www.espn.com/mens-college-basketball/game/_/id/401123420'
response = requests.get(url)
soupy = soup(response.content, 'html.parser')

containers = soupy.findAll("div",{"class" : "team-container"})
for container in containers:
    spans = container.findAll("span")
    divs = container.find("div",{"class": "record"})
    ranks = spans[0].text
    team_name = spans[1].text
    team_mascot = spans[2].text
    team_abbr = spans[3].text
    team_record = divs.text
    time_container = soupy.find("span", {"class":"time game-time"})
    game_times = time_container.text
    refs_container = soupy.find("div", {"class" : "game-info-note__container"})
    refs = refs_container.text
    print(ranks)
    print(team_name)
    print(team_mascot)
    print(team_abbr)
    print(team_record)
    print(game_times)
    print(refs)

The specific code I am concerned about is this,

 time_container = soupy.find("span", {"class":"time game-time"})
    game_times = time_container.text

I just provided the rest of the code to show that the .text on other span tags work. The time is the only data I truly want. I just get an empty string with how my code is currently.

This is the output of the code I get when I call time_container

<span class="time game-time" data-dateformat="time1" data-showtimezone="true"></span>

or just '' when I do game_times.

Here is the line of the HTML from the website:

<span class="time game-time" data-dateformat="time1" data-showtimezone="true">6:10 PM CT</span>

I don't understand why the 6:10 pm is gone when I run the script.

Upvotes: 5

Views: 1830

Answers (3)

QHarr
QHarr

Reputation: 84465

You can easily grab from an attribute on the page with requests

import requests
from bs4 import BeautifulSoup as bs
from dateutil.parser import parse

r = requests.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
soup = bs(r.content, 'lxml')
timing = soup.select_one('[data-date]')['data-date']
print(timing)
match_time = parse(timing).time()
print(match_time)

enter image description here

enter image description here

Upvotes: 1

Jose Ortiz
Jose Ortiz

Reputation: 311

An alternative would be to use some of ESPN's endpoints. These endpoints will return JSON responses. https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard

You can see other endpoints at this GitHub link https://gist.github.com/akeaswaran/b48b02f1c94f873c6655e7129910fc3b

This will make your application pretty light weight compared to running Selenium.

I recommend opening up inspect and going to the network tab. You can see all sorts of cool stuff happening. You can see all the requests that are happening in the site.

Upvotes: 2

Ajax1234
Ajax1234

Reputation: 71471

The site is dynamic, thus, you need to use selenium:

from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
game_time = soup(d.page_source, 'html.parser').find('span', {'class':'time game-time'}).text

Output:

'7:10 PM ET'

See full selenium documentation here.

Upvotes: 3

Related Questions