Reputation: 771
I'm trying to extract game stats of MLB games using BeautifulSoup. So far its been working well, but I just noticed that I'm unable to retrieve the information about the start time of the game using the usual way of doing so:
soup.findAll("span", {"class": "time game-time"})
What's weird about this is that it finds the exact element, and allows me to print it, and it shows that soup has found the all the contents of the element, except for the text. Unfortunately, the text part is all I need.
Images:
URL in question: http://www.espn.com/mlb/game?gameId=370925110
Is there any way around this issue without having to use a webdriver like Selenium?
Code:
with urllib.request.urlopen(link) as url:
page = url.read()
soup = BeautifulSoup(page, "html.parser")
clock = soup.findAll("span", {"class": "time game-time"})
print(clock[0])
Upvotes: 3
Views: 336
Reputation: 1576
While normally you would have to do some reverse-engineering, no external API is consumed here to fill in the game time.
The timestamp of the game can be found as variable in the script tag of the page source.
Plain Beautifulsoup will such suffice to get the timestamp:
js = str(soup.findAll("script", {"type": "text/javascript"}))
s = 'espn.gamepackage.timestamp = "'
idx = js.find(s) + len(s)
ts = ""
while js[idx] != '"':
ts += js[idx]
idx += 1
print(ts)
# 2017-09-25T17:05Z
The timestamp is in UTC as indicated by the trailing Z.
To convert to a different timezone you could use python-dateutil
:
from datetime import datetime
from dateutil import tz
ts = datetime.strptime(ts, "%Y-%m-%dT%H:%MZ")
ts = ts.replace(tzinfo=tz.gettz('UTC'))
target_tz = ts.astimezone(tz.gettz('Europe/Berlin'))
print(target_tz)
(see Python - Convert UTC datetime string to local datetime)
Upvotes: 3
Reputation: 10431
That is because this specific span
tag is filled by javascript.
If you want to see it by yourself, open the URL on your browser and look at code source of the page to locate this span, you will see:
<span class="time game-time" data-dateformat="time1" data-showtimezone="true"></span>
(or curl 'http://www.espn.com/mlb/game?gameId=370925110' | grep 'time game-time'
, whatever)
So you have to solutions here:
selenium
Upvotes: 2