Dan
Dan

Reputation: 33

Web scraping data form ajax page

I am attempting to scrape Job titles from here.

I am learning python scraping technique but I am stuck with the problem of scraping an Ajax page like this one. I am able to get the developer tool response data using below code for the first page. How to extract job titles from this data.

from bs4 import BeautifulSoup
import requests
import json

s = requests.Session()
headers={"User-Agent":"Mozilla/5.0"}
r=s.get('https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en',headers=headers)
html = r.text
soup = BeautifulSoup(html, 'lxml')
print(soup)

###how to extract job titles from soup###

Would really appreciate any help on this.

I am unfortunately currently limited to using only requests or another popular python library. Thanks in advance.

Upvotes: 1

Views: 254

Answers (3)

Andrej Kesely
Andrej Kesely

Reputation: 195418

Try:

import re
import json
import requests

url = "https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en"

data = re.search(r"listRequisition', (\[.*?\])\);", requests.get(url).text)
data = data.group(1).replace("'", '"')
data = json.loads(data)
for i in range(25):
    row = data[i * 40 : (i + 1) * 40]
    print(row[5])

Prints:

Technician, I %26 E (Coyanosa, TX)
Engineer, Senior Project
Engineer, Project
Mechanic, Truck (Monahans, TX)
Technician, Pipeline (Bryan/College Station)
Technician, Measurement (Farmington, NM)
Assistant, Field Administrative (Carlsbad, NM)
Technician, Pipeline (Greensburg, PA)
Human Resources Business Partner
Engineer, Senior Measurement
Accountant (Mont Belvieu)
Specialist, Senior Accounts Payable
Technician, Pipeline Trainee( Cape Girardeau)
Specialist, EAM Inventory
Welder - Class B
Specialist, Senior NGL Accounts Payable
Technician, Pipeline (Hobbs, NM)
Auditor, IT
Accountant, Intermediate
Accountant
Operator, Plant (Sonora, TX)
Technician, Pipeline (Carlsbad, NM)
Specialist, Maintenance (Lebanon, OH)
Technician, Pipeline Trainee 
Specialist, Senior Systems

Upvotes: 0

Matteo Bianchi
Matteo Bianchi

Reputation: 442

This site is dynamic (change data with javascript), so you have to use Selenium. You can run it in headless so it's like sending requests:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')

driver = webdriver.Chrome(executable_path=r'yourpath\chromedriver.exe', chrome_options=options)

driver.get('https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en')

html = (driver.page_source).encode('utf-8')
soup = BeautifulSoup(html, 'lxml')
print(soup)

Upvotes: 2

MendelG
MendelG

Reputation: 20018

The data is within a <script> tag. You can use the re module to find the correct jobs titles.

import re
import requests

headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(
    "https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en"
)
job_titles = re.findall(r"Add this position to the job cart: (.*?)'", response.text)
print(len(job_titles))
print(job_titles)

Output:

25
['Technician, I %26 E (Coyanosa, TX)', 'Engineer, Senior Project', 'Engineer, Project', 'Mechanic, Truck (Monahans, TX)', 'Technician, Pipeline (Bryan/College Station)', 'Technician, Measurement (Farmington, NM)', 'Assistant, Field Administrative (Carlsbad, NM)', 'Technician, Pipeline (Greensburg, PA)', 'Human Resources Business Partner', 'Engineer, Senior Measurement', 'Accountant (Mont Belvieu)', 'Specialist, Senior Accounts Payable', 'Technician, Pipeline Trainee( Cape Girardeau)', 'Specialist, EAM Inventory', 'Welder - Class B', 'Specialist, Senior NGL Accounts Payable', 'Technician, Pipeline (Hobbs, NM)', 'Auditor, IT', 'Accountant, Intermediate', 'Accountant', 'Operator, Plant (Sonora, TX)', 'Technician, Pipeline (Carlsbad, NM)', 'Specialist, Maintenance (Lebanon, OH)', 'Technician, Pipeline Trainee ', 'Specialist, Senior Systems']

Upvotes: 0

Related Questions