nasan
nasan

Reputation: 93

Handling Cookies while scraping with Python

I'm trying to scrape the links from the careers page on a college website, and I am getting this error.

urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Moved Temporarily

I think this is because the site has a session cookie. After doing a bit of reading, there seems to be many ways to get around this (Requests, http.cookiejar, Selenium/PhantomJs), but I don't know how to incorporate these solutions into my scraping program.

This is my scraping program. It's written in Python 3.6 with BeautifulSoup4.

from bs4 import BeautifulSoup
from urllib.request import urlopen

html = urlopen("https://jobs.fanshawec.ca/applicants/jsp/shared/search/SearchResults_css.jsp")
soup = BeautifulSoup(html, 'html.parser')
data = soup.select(".ft0 a")
ads = []

for i in data:
    link = i.get('href')
    ads.append(link)

for job in ads:
    print(job)
    print('')

When I clear the cookies in my browser and manually go to the page I'm trying to scrape (https://jobs.fanshawec.ca/applicants/jsp/shared/search/SearchResults_css.jsp), I'm taken to a different page. Once I have the cookie though, I can go directly to the SearchResults page that I want to scrape.

This is the cookie:

This is the cookie

Any thoughts on how I can deal with this cookie?

Upvotes: 1

Views: 7510

Answers (2)

Martin Evans
Martin Evans

Reputation: 46779

The website you are trying to access is probably testing for both cookies and Javascript to be present. Python does provide a CookieJar library but this will not be enough if javascript is also mandatory.

Instead you could use Selenium to get the HTML. It is a bit like a remote control for an existing browser, and can be used as follows:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

url = "https://jobs.fanshawec.ca/applicants/Central?delegateParameter=searchDelegate&actionParameter=showSearch&searchType=8192"

browser = webdriver.Firefox(firefox_binary=FirefoxBinary())
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')

data = soup.select(".ft0 a")
ads = []

for i in data:
    link = i.get('href')
    ads.append(link)

for job in ads:
    print(job)

(Also look at PhantomJS for a headless solution)

Which would give you your links starting as follows:

/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174604&c=%2BWIX1RV817HeJUg7cnxxnQ%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174585&c=4E7TSRVJx7jLG39iR7HvMw%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174563&c=EyCIe7a8xt0a%2BLp4xqtzaw%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants/Central?delegateParameter=applicantPostingSearchDelegate&actionParameter=getJobDetail&rowId=174566&c=coZCMU3091mmz%2BE7p%2BHNIg%3D%3D&pageLoadIdRequestKey=1490116459459&functionalityTableName=8192&windowTimestamp=null
/applicants

Note: To use Selenium, you will need to install it, as it is not part of the default Python libraries:

pip install selenium

Upvotes: 0

bennr01
bennr01

Reputation: 31

Using the requests-module:

from bs4 import BeautifulSoup
import requests

session = requests.Session()
req = session.get("https://jobs.fanshawec.ca/applicants/jsp/shared/search/SearchResults_css.jsp")
req.raise_for_status()  # omit this if you dont want an exception on a non-200 response
html = req.text
soup = BeautifulSoup(html, 'html.parser')
data = soup.select(".ft0 a")
ads = []

for i in data:
    link = i.get('href')
    ads.append(link)

for job in ads:
    print(job)
    print('')

However, i am not getting any output, which is probably due to ads being empty. I hope this helps you,

Upvotes: 1

Related Questions