Aaron Khong
Aaron Khong

Reputation: 23

Issue with scraping urls from dynamic webpage with BeautifulSoup and Selenium

I'm web scraping for the first time and I'm having trouble scraping a list of urls from a website. It works fine on colaboratory when I replace the specified path with /usr/lib/chromium-browser/chromedriver but when I try this code on my IDE....

Upvotes: 0

Views: 152

Answers (1)

baduker
baduker

Reputation: 20052

Just use chrome in the head mode. In other words, don't use headless.

from bs4 import BeautifulSoup
from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome(options=options)

courses = []
for i in range(1, 2):
    wd.get(f"https://www.sydney.edu.au/courses/search.html?search-type=course&page={i}")
    html_soup = BeautifulSoup(wd.page_source, "lxml")
    for x in html_soup.findAll("a", class_="b-result-container__item-wrapper b-result-container__item-wrapper--data b-link--no-underline"):
        courses.append(x.get("href"))

for x in courses:
    print(x)

Output:

https://www.sydney.edu.au/courses/courses/uc/bachelor-of-arts.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-science.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-commerce.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-economics.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-psychology0.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-pharmacy.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-music.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-science-health.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-arts-honours.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-advanced-computing.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-oral-health.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-visual-arts.html

You get this error because of the HeadlessChrome/89.0.4389.90 header. It's in the error traceback:

darkorange", source: https://www.sydney.edu.au/etc.clientlibs/courses/clientlibs/frontend-js.js (11714)
[0323/232203.250:INFO:CONSOLE(3)] "Hotjar not launching due to suspicious userAgent: Mozilla/5.0 (Windows NT 1
0.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/89.0.4389.90 Safari/537.36", source: ht
tps://static.hotjar.com/c/hotjar-550296.js?sv=6 (3)

Upvotes: 1

Related Questions