Thomas S.
Thomas S.

Reputation: 85

Scraping data from a table using BeautifulSoup and Selenium

I am trying to build an application that scrapes course information from a universities course catalogue and then constructs a few possible schedules a student could choose. The course catalogue url doesn't change each time a new course is searched for which is why I am using Selenium to automatically search for a course catalogue then Beautiful Soup to scrape the information. This is my first time using Beautiful Soup and Selenium so apologies in advance if the solution is quite simple.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import requests

URL = "http://saasta.byu.edu/noauth/classSchedule/index.php"
driver = webdriver.Safari()
driver.get(URL)
element = driver.find_element_by_id("searchBar")
element.send_keys("C S 142", Keys.RETURN)
response = requests.get(URL);
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find_all("tbody")
print(table)

Currently, when I print(table) it prints two objects. One, (first picture) has the general information regarding the course (information I don't need to scrape). The second object is empty. As far as I can tell there are only two tables on the website, both pictured below. The second one is the one I am interested scraping but for some reason the second element in table is empty.

enter image description here

The information is pictured below is the information I am trying to scrape.

This is the information I am trying to scrape

Output from the print(table)

<tbody>
   \n
   <tr>
      <th scope="row">Hours</th>
      <td id="courseCredits"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Prerequisites</th>
      <td id="coursePrereqs"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Recommended</th>
      <td id="courseRec"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Offered</th>
      <td id="courseOffered"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Headers</th>
      <td id="courseHeaders"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Note</th>
      <td id="courseNote"></td>
   </tr>
   \n
   <tr>
      <th scope="row">When\xa0Taught</th>
      <td id="courseWhenTaught"></td>
   </tr>
   \n
</tbody>
, 
<tbody></tbody>
]

Upvotes: 0

Views: 157

Answers (3)

pguardiario
pguardiario

Reputation: 55012

Here's a technique for parsing tables like that:

from requests import get
for js in ["http://code.jquery.com/jquery-1.11.3.min.js", "https://cdn.jsdelivr.net/npm/[email protected]/lib/jquery.tabletojson.min.js"]:
  body = get(js).content.decode('utf8')
  driver.execute_script(body)

data = driver.execute_script("return $('table#sectionTable').tableToJSON()")

Run on Repl.it

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195613

I just leave it here if you want solution without Selenium, using only requests module:

import json
import requests

url_classes = 'https://saasta.byu.edu/noauth/classSchedule/ajax/getClasses.php'
url_sections = 'https://saasta.byu.edu/noauth/classSchedule/ajax/getSections.php'

data_classes = {
    'searchObject[yearterm]':'20195',
    'searchObject[dept_name]':'C S',
    'searchObject[catalog_number]':'142',
    'sessionId':''
}

data_sections = {
    'courseId':'',
    'sessionId':'',
    'yearterm':'20195',
    'no_outcomes':'true'
}

classes = requests.post(url_classes, data=data_classes).json()
data_sections['courseId'] = next(iter(classes))
sections = requests.post(url_sections, data=data_sections).json()

# print(json.dumps(sections, indent=4)) # <-- uncomment this to see all data
# print(json.dumps(classes, indent=4))

for section in sections['sections']:
    print(section)
    print('-' * 80)

This prints all sections (but there's more data if you uncomment the print statements):

{'curriculum_id': '01489', 'title_code': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'section_number': '001', 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'section_type': 'DAY', 'credit_type': 'S', 'start_date': '2019-09-03', 'end_date': '2019-12-12', 'year_term': '20195', 'instructors': [{'person_id': '241223832', 'byu_id': '821566504', 'net_id': 'bretted', 'surname': 'Decker', 'sort_name': 'Decker, Brett E', 'rest_of_name': 'Brett E', 'preferred_first_name': 'Brett', 'phone_number': '801-380-4463', 'attribute_type': 'PRIMARY', 'year_term': '20195', 'curriculum_id': '01489', 'title_code': '002', 'section_number': '001', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'credit_type': 'S', 'section_type': 'DAY'}], 'times': [{'begin_time': '0900', 'end_time': '0950', 'building': 'TMCB', 'room': '1170', 'sequence_number': '2', 'mon': 'M', 'tue': '', 'wed': 'W', 'thu': '', 'fri': 'F', 'sat': '', 'sun': ''}], 'headers': [], 'availability': {'seats_available': '51', 'class_size': '203', 'waitlist_size': '0'}}
--------------------------------------------------------------------------------
{'curriculum_id': '01489', 'title_code': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'section_number': '002', 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'section_type': 'DAY', 'credit_type': 'S', 'start_date': '2019-09-03', 'end_date': '2019-12-12', 'year_term': '20195', 'instructors': [{'person_id': '241223832', 'byu_id': '821566504', 'net_id': 'bretted', 'surname': 'Decker', 'sort_name': 'Decker, Brett E', 'rest_of_name': 'Brett E', 'preferred_first_name': 'Brett', 'phone_number': '801-380-4463', 'attribute_type': 'PRIMARY', 'year_term': '20195', 'curriculum_id': '01489', 'title_code': '002', 'section_number': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'credit_type': 'S', 'section_type': 'DAY'}], 'times': [{'begin_time': '1000', 'end_time': '1050', 'building': 'TMCB', 'room': '1170', 'sequence_number': '2', 'mon': 'M', 'tue': '', 'wed': 'W', 'thu': '', 'fri': 'F', 'sat': '', 'sun': ''}], 'headers': [], 'availability': {'seats_available': '34', 'class_size': '203', 'waitlist_size': '0'}}
--------------------------------------------------------------------------------

...and so on.

Upvotes: 0

CEH
CEH

Reputation: 5909

This is pretty easy with just Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys

URL = "http://saasta.byu.edu/noauth/classSchedule/index.php"
driver = webdriver.Safari()
driver.get(URL)
element = driver.find_element_by_id("searchBar")
element.send_keys("C S 142", Keys.RETURN)

# get table
table = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//table[@id='sectionTable']")))

# iterate rows and cells
rows = table.find_elements_by_xpath("//tr")
for row in rows:

    # get cells
    cells = row.find_elements_by_tag_name("td")

    # iterate cells
    for cell in cells:
        print(cell.text)

Hopefully this gets you started.

Upvotes: 0

Related Questions