Reputation: 93
I'm trying to scrape the links from the careers page on a college website, and I am getting this error.
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Moved Temporarily
I think this is because the site has a session cookie. After doing a bit of reading, there seems to be many ways to get around this (Requests, http.cookiejar, Selenium/PhantomJs), but I don't know how to incorporate these solutions into my scraping program.
This is my scraping program. It's written in Python 3.6 with BeautifulSoup4.
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("https://jobs.fanshawec.ca/applicants/jsp/shared/search/SearchResults_css.jsp")
soup = BeautifulSoup(html, 'html.parser')
data = soup.select(".ft0 a")
ads = []
for i in data:
link = i.get('href')
ads.append(link)
for job in ads:
print(job)
print('')
When I clear the cookies in my browser and manually go to the page I'm trying to scrape (https://jobs.fanshawec.ca/applicants/jsp/shared/search/SearchResults_css.jsp), I'm taken to a different page. Once I have the cookie though, I can go directly to the SearchResults page that I want to scrape.
This is the cookie:
Any thoughts on how I can deal with this cookie?
Upvotes: 1
Views: 7510
Reputation: 46779
The website you are trying to access is probably testing for both cookies and Javascript to be present. Python does provide a CookieJar library but this will not be enough if javascript is also mandatory.
Instead you could use Selenium to get the HTML. It is a bit like a remote control for an existing browser, and can be used as follows:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
url = "https://jobs.fanshawec.ca/applicants/Central?delegateParameter=searchDelegate&actionParameter=showSearch&searchType=8192"
browser = webdriver.Firefox(firefox_binary=FirefoxBinary())
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')
data = soup.select(".ft0 a")
ads = []
for i in data:
link = i.get('href')
ads.append(link)
for job in ads:
print(job)
(Also look at PhantomJS for a headless solution)
Which would give you your links starting as follows:
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174604&c=%2BWIX1RV817HeJUg7cnxxnQ%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174585&c=4E7TSRVJx7jLG39iR7HvMw%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174563&c=EyCIe7a8xt0a%2BLp4xqtzaw%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174566&c=coZCMU3091mmz%2BE7p%2BHNIg%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants
Note: To use Selenium, you will need to install it, as it is not part of the default Python libraries:
pip install selenium
Upvotes: 0
Reputation: 31
Using the requests
-module:
from bs4 import BeautifulSoup
import requests
session = requests.Session()
req = session.get("https://jobs.fanshawec.ca/applicants/jsp/shared/search/SearchResults_css.jsp")
req.raise_for_status() # omit this if you dont want an exception on a non-200 response
html = req.text
soup = BeautifulSoup(html, 'html.parser')
data = soup.select(".ft0 a")
ads = []
for i in data:
link = i.get('href')
ads.append(link)
for job in ads:
print(job)
print('')
However, i am not getting any output, which is probably due to ads being empty. I hope this helps you,
Upvotes: 1