Reputation: 49
I am trying to scrape a web-page to list out the jobs posted in URL: https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad
Refer to image for details of web-page inspect Web inspect
Following is observed through a web-page inspect:
Each job listed, is in a HTML li with class="jobs-list-item". The Li contains following html tag & data in parent Div within li
data-ph-at-job-title-text="Software Engineer II", data-ph-at-job-category-text="Engineering", data-ph-at-job-post-date-text="2018-03-19T16:33:00".
1st Child Div within parent Div with class="information" has HTML with url href="https://careers.microsoft.com/us/en/job/406138/Software-Engineer-II"
My requirement is to extract below information for each job
I have tried following Python code to scrape the webpage, but unable to extract the required information. (Please ignore the indentation shown in code below)
import requests
from bs4 import BeautifulSoup
def ms_jobs():
url = 'https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad'
resp = requests.get(url)
if resp.status_code == 200:
print("Successfully opened the web page")
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
else:
print("Error")
ms_jobs()
Upvotes: 1
Views: 1245
Reputation: 740
If you want to do this via requests you need to reverse engineer the site. Open the dev tools in Chrome, select the networks tab and fill out the form.
This will show you how the site loads the data. If you dig in the site you'll see, that it grabs the data by doing a POST to this endpoint: https://careers.microsoft.com/widgets. It also shows you the payload that the site uses. The site uses cookies so all you have to do is create a session that keeps the cookie, get one and copy/paste the payload.
This way you'll be able to extract the same json-data, that the javascript fetches to populate the site dynamically.
Below is a working example of what that would look like. Left is only to parse out the json as you see fit.
import requests
from pprint import pprint
# create a session to grab a cookie from the site
session = requests.Session()
r = session.get("https://careers.microsoft.com/us/en/")
# these params are the ones that the dev tools show that site sets when using the website form
payload = {
"lang":"en_us",
"deviceType":"desktop",
"country":"us",
"ddoKey":"refineSearch",
"sortBy":"",
"subsearch":"",
"from":0,
"jobs":"true",
"counts":"true",
"all_fields":["country","state","city","category","employmentType","requisitionRoleType","educationLevel"],
"pageName":"search-results",
"size":20,
"keywords":"",
"global":"true",
"selected_fields":{"city":["Hyderabad"],"country":["India"]},
"sort":"null",
"locationData":{}
}
# this is the endpoint the site uses to fetch json
url = "https://careers.microsoft.com/widgets"
r = session.post(url, json=payload)
data = r.json()
job_list = data['refineSearch']['data']['jobs']
# the job_list will hold 20 jobs (you can se the parameter in the payload to a higher number if you please - I tested 100, that returned 100 jobs
job = job_list[0]
pprint(job)
Cheers.
Upvotes: 1