Reputation: 23
I'm web scraping for the first time and I'm having trouble scraping a list of urls from a website. It works fine on colaboratory when I replace the specified path with /usr/lib/chromium-browser/chromedriver but when I try this code on my IDE....
Upvotes: 0
Views: 152
Reputation: 20052
Just use chrome
in the head
mode. In other words, don't use headless
.
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome(options=options)
courses = []
for i in range(1, 2):
wd.get(f"https://www.sydney.edu.au/courses/search.html?search-type=course&page={i}")
html_soup = BeautifulSoup(wd.page_source, "lxml")
for x in html_soup.findAll("a", class_="b-result-container__item-wrapper b-result-container__item-wrapper--data b-link--no-underline"):
courses.append(x.get("href"))
for x in courses:
print(x)
Output:
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-arts.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-science.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-commerce.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-economics.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-psychology0.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-pharmacy.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-music.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-science-health.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-arts-honours.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-advanced-computing.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-oral-health.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-visual-arts.html
You get this error because of the HeadlessChrome/89.0.4389.90
header. It's in the error traceback:
darkorange", source: https://www.sydney.edu.au/etc.clientlibs/courses/clientlibs/frontend-js.js (11714)
[0323/232203.250:INFO:CONSOLE(3)] "Hotjar not launching due to suspicious userAgent: Mozilla/5.0 (Windows NT 1
0.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/89.0.4389.90 Safari/537.36", source: ht
tps://static.hotjar.com/c/hotjar-550296.js?sv=6 (3)
Upvotes: 1