James Walters
James Walters

Reputation: 1

Why can't I see/use the innerHTML of this site in my python interpreter?

I'm working on webscraping project currently using BS4 where I am trying to aggregate college tuition data. I'm using the site tuitiontracker.org as a data source. Once I've navigated to a specific college, I want to scrape the tuition data off of the site. When I inspect element, I can see the tuition data stored as an innerHTML, but when I use beautiful soup to find it, it returns everything about the location except for the tuition data.

Here is the url I'm trying to scrape from: https://www.tuitiontracker.org/school.html?unitid=164580

Here is the code I am using:

import urllib.request
from bs4 import BeautifulSoup


DOWNLOAD_URL = "https://www.tuitiontracker.org/school.html?unitid=164580"


def download_page(url):
    return urllib.request.urlopen(url)

# print(download_page(DOWNLOAD_URL).read())

def parse_html(html):
    """Gathers data from an HTML page"""
    soup = BeautifulSoup(html, features="html.parser")
    # print(soup.prettify())
    tuition_data = soup.find("div", attrs={"id": "price"}).innerHTML
    
    print(tuition_data)

def main():
    url = DOWNLOAD_URL
    parse_html(download_page(DOWNLOAD_URL).read())


if __name__ == "__main__":
    main()

When I print tuition_data, I see the relevant tags where the tuition data is stored on the page, but no number value. I've tried using .innerHTML and .string but they end up printing either None, or simply a blank space.

Really quite confused, thanks for any clarification.

Upvotes: 0

Views: 75

Answers (1)

baduker
baduker

Reputation: 20042

The data comes from an API endpoint and is dynamically rendered by JavaScript so you won't get it with BeautifulSoup.

However, you can query the endpoint.

Here's how:

import json

import requests

url = "https://www.tuitiontracker.org/school.html?unitid=164580"
api_endpoint = f"https://www.tuitiontracker.org/data/school-data-09042019/"
response = requests.get(f"{api_endpoint}{url.split('=')[-1]}.json").json()
tuition = response["yearly_data"][0]
print(
    round(tuition["price_instate_oncampus"], 2),
    round(tuition["avg_net_price_0_30000_titleiv_privateforprofit"], 2),
)

Output:

75099.8 30255.86

PS. There's a lot more in that JSON. Pro tip for future web-scraping endeavors: you favorite web browser's Developer Tools should be your best friend.

Here's what it looks like behind the scenes:

enter image description here

Upvotes: 2

Related Questions