Reputation: 1
I'm working on webscraping project currently using BS4 where I am trying to aggregate college tuition data. I'm using the site tuitiontracker.org as a data source. Once I've navigated to a specific college, I want to scrape the tuition data off of the site. When I inspect element, I can see the tuition data stored as an innerHTML, but when I use beautiful soup to find it, it returns everything about the location except for the tuition data.
Here is the url I'm trying to scrape from: https://www.tuitiontracker.org/school.html?unitid=164580
Here is the code I am using:
import urllib.request
from bs4 import BeautifulSoup
DOWNLOAD_URL = "https://www.tuitiontracker.org/school.html?unitid=164580"
def download_page(url):
return urllib.request.urlopen(url)
# print(download_page(DOWNLOAD_URL).read())
def parse_html(html):
"""Gathers data from an HTML page"""
soup = BeautifulSoup(html, features="html.parser")
# print(soup.prettify())
tuition_data = soup.find("div", attrs={"id": "price"}).innerHTML
print(tuition_data)
def main():
url = DOWNLOAD_URL
parse_html(download_page(DOWNLOAD_URL).read())
if __name__ == "__main__":
main()
When I print tuition_data, I see the relevant tags where the tuition data is stored on the page, but no number value. I've tried using .innerHTML and .string but they end up printing either None, or simply a blank space.
Really quite confused, thanks for any clarification.
Upvotes: 0
Views: 75
Reputation: 20042
The data comes from an API endpoint and is dynamically rendered by JavaScript
so you won't get it with BeautifulSoup
.
However, you can query the endpoint.
Here's how:
import json
import requests
url = "https://www.tuitiontracker.org/school.html?unitid=164580"
api_endpoint = f"https://www.tuitiontracker.org/data/school-data-09042019/"
response = requests.get(f"{api_endpoint}{url.split('=')[-1]}.json").json()
tuition = response["yearly_data"][0]
print(
round(tuition["price_instate_oncampus"], 2),
round(tuition["avg_net_price_0_30000_titleiv_privateforprofit"], 2),
)
Output:
75099.8 30255.86
PS. There's a lot more in that JSON
. Pro tip for future web-scraping endeavors: you favorite web browser's Developer Tools should be your best friend.
Here's what it looks like behind the scenes:
Upvotes: 2