Reputation: 703
I would like to search jobs in my field in batch mode with the use of BeautifulSoup
. I would have a list of urls, all consisting of employer career pages. If the search finds the keyword GIS
in the job title, I want it to return the link to the job posting.
I'll give some case scenarios:
The first company site required a keyword search. This page is the result:
https://jobs-challp.icims.com/jobs/search?ss=1&searchKeyword=gis&searchCategory=&searchLocation=&latitude=&longitude=&searchZip=&searchRadius=20
I would like it to return the following:
https://jobs-challp.icims.com/jobs/2432/gis-specialist/job
https://jobs-challp.icims.com/jobs/2369/gis-specialist/job
The second site did not require a keyword search:
https://www.smartrecruiters.com/SpectraForce1/
I would like it to return the following:
https://www.smartrecruiters.com/SpectraForce1/74966857-gis-specialist
https://www.smartrecruiters.com/SpectraForce1/74944180-gis-technician
This is as far as I can get:
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.smartrecruiters.com/SpectraForce1/'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
text = soup.get_text()
if 'GIS ' in text:
print 'Job Found!'
There are two problems:
1.) This of course returns a confirmation that a job is found, but does not return the link to the job itself
2.) The two relevant positions were not found using this method for the first company site. I checked this by scanning the output of soup.get_text()
, and saw that it did not contain job titles in the returned text.
Any help or additional suggestions would be appreciated.
Thanks!
Upvotes: 0
Views: 590
Reputation:
Here's my attempt, but it's pretty much the same as above :
from bs4 import BeautifulSoup
from urllib2 import urlopen
def work(url):
soup = BeautifulSoup(urlopen(url).read())
for i in soup.findAll("a", text=True):
if "GIS" in i.text:
print "Found link "+i["href"].replace("?in_iframe=1", "")
urls = ["https://jobs-challp.icims.com/jobs/search?pr=0&searchKeyword=gis&searchRadius=20&in_iframe=1", "https://www.smartrecruiters.com/SpectraForce1/"]
for i in urls:
work(i)
It defines a function "work()" that does the actual work, getting the page from the remote server; using urlopen()
since it looked like you wanted to use urllib2
but I suggest you use Python-Requests; then it finds all a
elements (links) using findAll()
, and for each link it checks if "GIS" is in the link's text, and if it is then it prints the link's href
attribute.
Then it defines the list of URLs (just 2 URLs in this case) using a list comprehension, and then it runs the work()
function for each URL in the list and passes it as an argument to the function.
Upvotes: 1
Reputation: 7821
Here's a go!
This code will find all links with the string containing 'GIS'. I needed to add &in_iframe=1
to make the first link work.
import urllib2
from bs4 import BeautifulSoup
urls = ['https://jobs-challp.icims.com/jobs/search?ss=1&searchKeyword=gis&searchCategory=&searchLocation=&latitude=&longitude=&searchZip=&searchRadius=20&in_iframe=1',
'https://www.smartrecruiters.com/SpectraForce1/']
for url in urls:
soup = BeautifulSoup(urllib2.urlopen(url))
print 'Scraping {}'.format(url)
for link in soup.find_all('a'):
if 'GIS' in link.text:
print '--> TEXT: ' + link.text.strip()
print '--> URL: ' + link['href']
print ''
Output:
Scraping https://jobs-challp.icims.com/jobs/search?ss=1&searchKeyword=gis&searchCategory=&searchLocation=&latitude=&longitude=&searchZip=&searchRadius=20&in_iframe=1
--> TEXT: GIS Specialist
--> URL: https://jobs-challp.icims.com/jobs/2432/gis-specialist/job?in_iframe=1
--> TEXT: GIS Specialist
--> URL: https://jobs-challp.icims.com/jobs/2369/gis-specialist/job?in_iframe=1
Scraping https://www.smartrecruiters.com/SpectraForce1/
--> TEXT: Technical Specialist/ Research Analyst/ GIS/ Engineering Technician
--> URL: https://www.smartrecruiters.com/SpectraForce1/74985505-technical-specialist
--> TEXT: GIS Specialist
--> URL: https://www.smartrecruiters.com/SpectraForce1/74966857-gis-specialist
--> TEXT: GIS Technician
--> URL: https://www.smartrecruiters.com/SpectraForce1/74944180-gis-technician
Upvotes: 1
Reputation: 8043
here is one way:
from bs4 import BeautifulSoup
import urllib2
import re
url = 'https://www.smartrecruiters.com/SpectraForce1/'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
titles = [i.get_text() for i in soup.findAll('a', {'target':'_blank'})]
jobs = [re.sub('\s+',' ',title) for title in titles]
links = [i.get('href') for i in soup.findAll('a', {'target':'_blank'})]
for i,j in enumerate(jobs):
if 'GIS' in j:
print links[i]
if you run this right now it will print:
https://www.smartrecruiters.com/SpectraForce1/74985505-technical-specialist
https://www.smartrecruiters.com/SpectraForce1/74966857-gis-specialist
https://www.smartrecruiters.com/SpectraForce1/74944180-gis-technician
Upvotes: 1