baobobs
baobobs

Reputation: 703

BeautifulSoup for job searching

I would like to search jobs in my field in batch mode with the use of BeautifulSoup. I would have a list of urls, all consisting of employer career pages. If the search finds the keyword GIS in the job title, I want it to return the link to the job posting.

I'll give some case scenarios:

The first company site required a keyword search. This page is the result:

https://jobs-challp.icims.com/jobs/search?ss=1&searchKeyword=gis&searchCategory=&searchLocation=&latitude=&longitude=&searchZip=&searchRadius=20

I would like it to return the following:

https://jobs-challp.icims.com/jobs/2432/gis-specialist/job

https://jobs-challp.icims.com/jobs/2369/gis-specialist/job

The second site did not require a keyword search:

https://www.smartrecruiters.com/SpectraForce1/

I would like it to return the following:

https://www.smartrecruiters.com/SpectraForce1/74966857-gis-specialist

https://www.smartrecruiters.com/SpectraForce1/74944180-gis-technician

This is as far as I can get:

from bs4 import BeautifulSoup
import urllib2

url = 'https://www.smartrecruiters.com/SpectraForce1/'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)

text = soup.get_text()

if 'GIS ' in text:
        print 'Job Found!'

There are two problems: 1.) This of course returns a confirmation that a job is found, but does not return the link to the job itself 2.) The two relevant positions were not found using this method for the first company site. I checked this by scanning the output of soup.get_text(), and saw that it did not contain job titles in the returned text.

Any help or additional suggestions would be appreciated.

Thanks!

Upvotes: 0

Views: 590

Answers (3)

user2629998
user2629998

Reputation:

Here's my attempt, but it's pretty much the same as above :

from bs4 import BeautifulSoup
from urllib2 import urlopen

def work(url):
    soup = BeautifulSoup(urlopen(url).read())

    for i in soup.findAll("a", text=True):
        if "GIS" in i.text:
            print "Found link "+i["href"].replace("?in_iframe=1", "")

urls = ["https://jobs-challp.icims.com/jobs/search?pr=0&searchKeyword=gis&searchRadius=20&in_iframe=1", "https://www.smartrecruiters.com/SpectraForce1/"]

for i in urls:
    work(i)

It defines a function "work()" that does the actual work, getting the page from the remote server; using urlopen() since it looked like you wanted to use urllib2 but I suggest you use Python-Requests; then it finds all a elements (links) using findAll(), and for each link it checks if "GIS" is in the link's text, and if it is then it prints the link's href attribute.

Then it defines the list of URLs (just 2 URLs in this case) using a list comprehension, and then it runs the work() function for each URL in the list and passes it as an argument to the function.

Upvotes: 1

Steinar Lima
Steinar Lima

Reputation: 7821

Here's a go!

This code will find all links with the string containing 'GIS'. I needed to add &in_iframe=1 to make the first link work.

import urllib2
from bs4 import BeautifulSoup

urls = ['https://jobs-challp.icims.com/jobs/search?ss=1&searchKeyword=gis&searchCategory=&searchLocation=&latitude=&longitude=&searchZip=&searchRadius=20&in_iframe=1',
        'https://www.smartrecruiters.com/SpectraForce1/']

for url in urls:
    soup = BeautifulSoup(urllib2.urlopen(url))
    print 'Scraping {}'.format(url)
    for link in soup.find_all('a'):
        if 'GIS' in link.text:
            print '--> TEXT: ' + link.text.strip()
            print '--> URL:  ' + link['href']
            print ''

Output:

Scraping https://jobs-challp.icims.com/jobs/search?ss=1&searchKeyword=gis&searchCategory=&searchLocation=&latitude=&longitude=&searchZip=&searchRadius=20&in_iframe=1
--> TEXT: GIS Specialist
--> URL:  https://jobs-challp.icims.com/jobs/2432/gis-specialist/job?in_iframe=1

--> TEXT: GIS Specialist
--> URL:  https://jobs-challp.icims.com/jobs/2369/gis-specialist/job?in_iframe=1

Scraping https://www.smartrecruiters.com/SpectraForce1/
--> TEXT: Technical Specialist/ Research Analyst/ GIS/ Engineering Technician
--> URL:  https://www.smartrecruiters.com/SpectraForce1/74985505-technical-specialist

--> TEXT: GIS Specialist
--> URL:  https://www.smartrecruiters.com/SpectraForce1/74966857-gis-specialist

--> TEXT: GIS Technician
--> URL:  https://www.smartrecruiters.com/SpectraForce1/74944180-gis-technician

Upvotes: 1

Serial
Serial

Reputation: 8043

here is one way:

from bs4 import BeautifulSoup
import urllib2
import re

url = 'https://www.smartrecruiters.com/SpectraForce1/'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

titles = [i.get_text() for i in soup.findAll('a', {'target':'_blank'})]
jobs = [re.sub('\s+',' ',title) for title in titles]

links = [i.get('href') for i in soup.findAll('a', {'target':'_blank'})]

for i,j in enumerate(jobs):
    if 'GIS' in j:
        print links[i]

if you run this right now it will print:

https://www.smartrecruiters.com/SpectraForce1/74985505-technical-specialist
https://www.smartrecruiters.com/SpectraForce1/74966857-gis-specialist
https://www.smartrecruiters.com/SpectraForce1/74944180-gis-technician

Upvotes: 1

Related Questions