Web scraping with BeautifulSoup: table not in page source

Question

I am attempting to scrape data from a table located on the following webpage:

http://ontariohockeyleague.com/stats/players/60

Here's the code that I have written so far.

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'http://ontariohockeyleague.com/stats/players/60'

#open webpage, read html, close webpage
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

The problem is that the table is actually not contained in the html code, so far as I can tell. From inspecting the webpage, the table is located in this main block, but for whatever reason, BeautifulSoup doesn't read it.

page_soup.main

If I view the page source, it also does not contain the table, but only the above main block. I have also used other parsers with BeautifulSoup, and it returns the same result.

How do I access the table?

Andrej Kesely · Accepted Answer

From network inspector it seems that the page is loaded dynamically from http://lscluster.hockeytech.com/feed/ in JSON format. For obtaining any data, it needs the key from the main site. Example is here (the data is stored in variables seasons_data, teamsbyseason_data, statviewtype_data):

import requests
from bs4 import BeautifulSoup
import json

url = "http://ontariohockeyleague.com/stats/players/60"

r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

seasons_url = "http://lscluster.hockeytech.com/feed/?feed=modulekit&view=seasons&key=%s&fmt=json&client_code=ohl&lang=en&league_code=&fmt=json"
teamsbyseason_url = "http://lscluster.hockeytech.com/feed/?feed=modulekit&view=teamsbyseason&key=%s&fmt=json&client_code=ohl&lang=en&season_id=60&league_code=&fmt=json"
statviewtype_url = "http://lscluster.hockeytech.com/feed/?feed=modulekit&view=statviewtype&type=topscorers&key=%s&fmt=json&client_code=ohl&lang=en&league_code=&season_id=60&first=0&limit=100&sort=active&stat=all&order_direction="

key = soup.find('div', id='stats')['data-feed_key']

r = requests.get(seasons_url % key)
seasons_data = json.loads(r.text)

r = requests.get(teamsbyseason_url % key)
teamsbyseason_data = json.loads(r.text)

r = requests.get(statviewtype_url % key)
statviewtype_data = json.loads(r.text)

# print(json.dumps(seasons_data, indent=4, sort_keys=True))
# print(json.dumps(teamsbyseason_data, indent=4, sort_keys=True))
print(json.dumps(statviewtype_data, indent=4, sort_keys=True))

Prints:

{
    "SiteKit": {
        "Copyright": {
            "powered_by": "Powered by HockeyTech.com",
            "powered_by_url": "http://hockeytech.com",
            "required_copyright": "Official statistics provided by Ontario Hockey League",
            "required_link": "http://leaguestat.com"
        },
        "Parameters": {
            "client_code": "ohl",
            "feed": "modulekit",
            "first": "0",
            "fmt": "json",
            "key": "2976319eb44abe94",
            "lang": "en",
            "lang_id": 1,
            "league_code": "",
            "league_id": "1",
            "limit": "100",
            "order_direction": "",
            "season_id": 60,
            "sort": "active",
            "stat": "all",
            "team_id": 0,
            "type": "topscorers",
            "view": "statviewtype"
        },

... and so on...

Web scraping with BeautifulSoup: table not in page source

Answers (2)

Related Questions