Reputation: 33
I am attempting to scrape Job titles from here.
I am learning python scraping technique but I am stuck with the problem of scraping an Ajax page like this one. I am able to get the developer tool response data using below code for the first page. How to extract job titles from this data.
from bs4 import BeautifulSoup
import requests
import json
s = requests.Session()
headers={"User-Agent":"Mozilla/5.0"}
r=s.get('https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en',headers=headers)
html = r.text
soup = BeautifulSoup(html, 'lxml')
print(soup)
###how to extract job titles from soup###
Would really appreciate any help on this.
I am unfortunately currently limited to using only requests or another popular python library. Thanks in advance.
Upvotes: 1
Views: 254
Reputation: 195418
Try:
import re
import json
import requests
url = "https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en"
data = re.search(r"listRequisition', (\[.*?\])\);", requests.get(url).text)
data = data.group(1).replace("'", '"')
data = json.loads(data)
for i in range(25):
row = data[i * 40 : (i + 1) * 40]
print(row[5])
Prints:
Technician, I %26 E (Coyanosa, TX)
Engineer, Senior Project
Engineer, Project
Mechanic, Truck (Monahans, TX)
Technician, Pipeline (Bryan/College Station)
Technician, Measurement (Farmington, NM)
Assistant, Field Administrative (Carlsbad, NM)
Technician, Pipeline (Greensburg, PA)
Human Resources Business Partner
Engineer, Senior Measurement
Accountant (Mont Belvieu)
Specialist, Senior Accounts Payable
Technician, Pipeline Trainee( Cape Girardeau)
Specialist, EAM Inventory
Welder - Class B
Specialist, Senior NGL Accounts Payable
Technician, Pipeline (Hobbs, NM)
Auditor, IT
Accountant, Intermediate
Accountant
Operator, Plant (Sonora, TX)
Technician, Pipeline (Carlsbad, NM)
Specialist, Maintenance (Lebanon, OH)
Technician, Pipeline Trainee
Specialist, Senior Systems
Upvotes: 0
Reputation: 442
This site is dynamic (change data with javascript), so you have to use Selenium. You can run it in headless so it's like sending requests:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(executable_path=r'yourpath\chromedriver.exe', chrome_options=options)
driver.get('https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en')
html = (driver.page_source).encode('utf-8')
soup = BeautifulSoup(html, 'lxml')
print(soup)
Upvotes: 2
Reputation: 20018
The data is within a <script>
tag. You can use the re
module to find the correct jobs titles.
import re
import requests
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(
"https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en"
)
job_titles = re.findall(r"Add this position to the job cart: (.*?)'", response.text)
print(len(job_titles))
print(job_titles)
Output:
25
['Technician, I %26 E (Coyanosa, TX)', 'Engineer, Senior Project', 'Engineer, Project', 'Mechanic, Truck (Monahans, TX)', 'Technician, Pipeline (Bryan/College Station)', 'Technician, Measurement (Farmington, NM)', 'Assistant, Field Administrative (Carlsbad, NM)', 'Technician, Pipeline (Greensburg, PA)', 'Human Resources Business Partner', 'Engineer, Senior Measurement', 'Accountant (Mont Belvieu)', 'Specialist, Senior Accounts Payable', 'Technician, Pipeline Trainee( Cape Girardeau)', 'Specialist, EAM Inventory', 'Welder - Class B', 'Specialist, Senior NGL Accounts Payable', 'Technician, Pipeline (Hobbs, NM)', 'Auditor, IT', 'Accountant, Intermediate', 'Accountant', 'Operator, Plant (Sonora, TX)', 'Technician, Pipeline (Carlsbad, NM)', 'Specialist, Maintenance (Lebanon, OH)', 'Technician, Pipeline Trainee ', 'Specialist, Senior Systems']
Upvotes: 0