Reputation: 85
I am trying to build an application that scrapes course information from a universities course catalogue and then constructs a few possible schedules a student could choose. The course catalogue url doesn't change each time a new course is searched for which is why I am using Selenium to automatically search for a course catalogue then Beautiful Soup to scrape the information. This is my first time using Beautiful Soup and Selenium so apologies in advance if the solution is quite simple.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import requests
URL = "http://saasta.byu.edu/noauth/classSchedule/index.php"
driver = webdriver.Safari()
driver.get(URL)
element = driver.find_element_by_id("searchBar")
element.send_keys("C S 142", Keys.RETURN)
response = requests.get(URL);
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find_all("tbody")
print(table)
Currently, when I print(table)
it prints two objects. One, (first picture) has the general information regarding the course (information I don't need to scrape). The second object is empty. As far as I can tell there are only two tables on the website, both pictured below. The second one is the one I am interested scraping but for some reason the second element in table
is empty.
The information is pictured below is the information I am trying to scrape.
Output from the print(table)
<tbody>
\n
<tr>
<th scope="row">Hours</th>
<td id="courseCredits"></td>
</tr>
\n
<tr>
<th scope="row">Prerequisites</th>
<td id="coursePrereqs"></td>
</tr>
\n
<tr>
<th scope="row">Recommended</th>
<td id="courseRec"></td>
</tr>
\n
<tr>
<th scope="row">Offered</th>
<td id="courseOffered"></td>
</tr>
\n
<tr>
<th scope="row">Headers</th>
<td id="courseHeaders"></td>
</tr>
\n
<tr>
<th scope="row">Note</th>
<td id="courseNote"></td>
</tr>
\n
<tr>
<th scope="row">When\xa0Taught</th>
<td id="courseWhenTaught"></td>
</tr>
\n
</tbody>
,
<tbody></tbody>
]
Upvotes: 0
Views: 157
Reputation: 55012
Here's a technique for parsing tables like that:
from requests import get
for js in ["http://code.jquery.com/jquery-1.11.3.min.js", "https://cdn.jsdelivr.net/npm/[email protected]/lib/jquery.tabletojson.min.js"]:
body = get(js).content.decode('utf8')
driver.execute_script(body)
data = driver.execute_script("return $('table#sectionTable').tableToJSON()")
Upvotes: 1
Reputation: 195613
I just leave it here if you want solution without Selenium, using only requests
module:
import json
import requests
url_classes = 'https://saasta.byu.edu/noauth/classSchedule/ajax/getClasses.php'
url_sections = 'https://saasta.byu.edu/noauth/classSchedule/ajax/getSections.php'
data_classes = {
'searchObject[yearterm]':'20195',
'searchObject[dept_name]':'C S',
'searchObject[catalog_number]':'142',
'sessionId':''
}
data_sections = {
'courseId':'',
'sessionId':'',
'yearterm':'20195',
'no_outcomes':'true'
}
classes = requests.post(url_classes, data=data_classes).json()
data_sections['courseId'] = next(iter(classes))
sections = requests.post(url_sections, data=data_sections).json()
# print(json.dumps(sections, indent=4)) # <-- uncomment this to see all data
# print(json.dumps(classes, indent=4))
for section in sections['sections']:
print(section)
print('-' * 80)
This prints all sections (but there's more data if you uncomment the print statements):
{'curriculum_id': '01489', 'title_code': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'section_number': '001', 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'section_type': 'DAY', 'credit_type': 'S', 'start_date': '2019-09-03', 'end_date': '2019-12-12', 'year_term': '20195', 'instructors': [{'person_id': '241223832', 'byu_id': '821566504', 'net_id': 'bretted', 'surname': 'Decker', 'sort_name': 'Decker, Brett E', 'rest_of_name': 'Brett E', 'preferred_first_name': 'Brett', 'phone_number': '801-380-4463', 'attribute_type': 'PRIMARY', 'year_term': '20195', 'curriculum_id': '01489', 'title_code': '002', 'section_number': '001', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'credit_type': 'S', 'section_type': 'DAY'}], 'times': [{'begin_time': '0900', 'end_time': '0950', 'building': 'TMCB', 'room': '1170', 'sequence_number': '2', 'mon': 'M', 'tue': '', 'wed': 'W', 'thu': '', 'fri': 'F', 'sat': '', 'sun': ''}], 'headers': [], 'availability': {'seats_available': '51', 'class_size': '203', 'waitlist_size': '0'}}
--------------------------------------------------------------------------------
{'curriculum_id': '01489', 'title_code': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'section_number': '002', 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'section_type': 'DAY', 'credit_type': 'S', 'start_date': '2019-09-03', 'end_date': '2019-12-12', 'year_term': '20195', 'instructors': [{'person_id': '241223832', 'byu_id': '821566504', 'net_id': 'bretted', 'surname': 'Decker', 'sort_name': 'Decker, Brett E', 'rest_of_name': 'Brett E', 'preferred_first_name': 'Brett', 'phone_number': '801-380-4463', 'attribute_type': 'PRIMARY', 'year_term': '20195', 'curriculum_id': '01489', 'title_code': '002', 'section_number': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'credit_type': 'S', 'section_type': 'DAY'}], 'times': [{'begin_time': '1000', 'end_time': '1050', 'building': 'TMCB', 'room': '1170', 'sequence_number': '2', 'mon': 'M', 'tue': '', 'wed': 'W', 'thu': '', 'fri': 'F', 'sat': '', 'sun': ''}], 'headers': [], 'availability': {'seats_available': '34', 'class_size': '203', 'waitlist_size': '0'}}
--------------------------------------------------------------------------------
...and so on.
Upvotes: 0
Reputation: 5909
This is pretty easy with just Selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
URL = "http://saasta.byu.edu/noauth/classSchedule/index.php"
driver = webdriver.Safari()
driver.get(URL)
element = driver.find_element_by_id("searchBar")
element.send_keys("C S 142", Keys.RETURN)
# get table
table = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//table[@id='sectionTable']")))
# iterate rows and cells
rows = table.find_elements_by_xpath("//tr")
for row in rows:
# get cells
cells = row.find_elements_by_tag_name("td")
# iterate cells
for cell in cells:
print(cell.text)
Hopefully this gets you started.
Upvotes: 0