Reputation: 13
I am attempting to scrape data from a table located on the following webpage:
http://ontariohockeyleague.com/stats/players/60
Here's the code that I have written so far.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'http://ontariohockeyleague.com/stats/players/60'
#open webpage, read html, close webpage
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
The problem is that the table is actually not contained in the html code, so far as I can tell. From inspecting the webpage, the table is located in this main block, but for whatever reason, BeautifulSoup doesn't read it.
page_soup.main
<main class="container">
<div class="container-content" data-feed_key="2976319eb44abe94" data-is-league="1" data-lang="en" data-league="ohl" data-league-code="" data-pagesize="100" data-season="63" id="stats"></div>
</main>
If I view the page source, it also does not contain the table, but only the above main block. I have also used other parsers with BeautifulSoup, and it returns the same result.
How do I access the table?
Upvotes: 1
Views: 1666
Reputation: 195418
From network inspector it seems that the page is loaded dynamically from http://lscluster.hockeytech.com/feed/
in JSON format. For obtaining any data, it needs the key from the main site. Example is here (the data is stored in variables seasons_data
, teamsbyseason_data
, statviewtype_data
):
import requests
from bs4 import BeautifulSoup
import json
url = "http://ontariohockeyleague.com/stats/players/60"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
seasons_url = "http://lscluster.hockeytech.com/feed/?feed=modulekit&view=seasons&key=%s&fmt=json&client_code=ohl&lang=en&league_code=&fmt=json"
teamsbyseason_url = "http://lscluster.hockeytech.com/feed/?feed=modulekit&view=teamsbyseason&key=%s&fmt=json&client_code=ohl&lang=en&season_id=60&league_code=&fmt=json"
statviewtype_url = "http://lscluster.hockeytech.com/feed/?feed=modulekit&view=statviewtype&type=topscorers&key=%s&fmt=json&client_code=ohl&lang=en&league_code=&season_id=60&first=0&limit=100&sort=active&stat=all&order_direction="
key = soup.find('div', id='stats')['data-feed_key']
r = requests.get(seasons_url % key)
seasons_data = json.loads(r.text)
r = requests.get(teamsbyseason_url % key)
teamsbyseason_data = json.loads(r.text)
r = requests.get(statviewtype_url % key)
statviewtype_data = json.loads(r.text)
# print(json.dumps(seasons_data, indent=4, sort_keys=True))
# print(json.dumps(teamsbyseason_data, indent=4, sort_keys=True))
print(json.dumps(statviewtype_data, indent=4, sort_keys=True))
Prints:
{
"SiteKit": {
"Copyright": {
"powered_by": "Powered by HockeyTech.com",
"powered_by_url": "http://hockeytech.com",
"required_copyright": "Official statistics provided by Ontario Hockey League",
"required_link": "http://leaguestat.com"
},
"Parameters": {
"client_code": "ohl",
"feed": "modulekit",
"first": "0",
"fmt": "json",
"key": "2976319eb44abe94",
"lang": "en",
"lang_id": 1,
"league_code": "",
"league_id": "1",
"limit": "100",
"order_direction": "",
"season_id": 60,
"sort": "active",
"stat": "all",
"team_id": 0,
"type": "topscorers",
"view": "statviewtype"
},
... and so on...
Upvotes: 2
Reputation: 73
The table is rendered using Javascript, thus it doesn't show up in the initial HTML that is loaded by urllib. You could either find the API the page is using and get the data from there or use a headless browser to get the full Javascript rendered HTML.
Upvotes: 2