Reputation: 37
I'm having some issues in crawling this website search:
I'm trying to extract these elements from de SimplyHired search jobs for Data Engineer in US:
But when I try using xpath locator to any of them using selector module I'm getting different results and in different order.
Also the output for all of them isn't matching (The index corresponding to xpath job name is not the same index for ther location in xpath location for example).
Here is my code:
from scrapy import Selector
import requests
response = requests.get('https://www.simplyhired.com/search?q=data+engineer&l=united+states&mi=exact&sb=dd&pn=1&job=X1yGOt2Y8QTJm0tYqyptbgV9Pu19ge0GkVZK7Im5WbXm-zUr-QMM-A').content
sel=Selector(text=response)
#job name
sel.xpath('//main[@id="job-list"]/div/article[contains(@class,"SerpJob")]/div/div[@class="jobposting-title-container"]/h2/a/text()').extract()
#company
sel.xpath('//main[@id="job-list"]/div/article/div/h3[@class="jobposting-subtitle"]/span[@class="JobPosting-labelWithIcon jobposting-company"]/text()').extract()
#location
sel.xpath('//main[@id="job-list"]//div/article/div/h3[@class="jobposting-subtitle"]/span[@class="JobPosting-labelWithIcon jobposting-location"]/span/span/text()').extract()
#salary estimates
sel.xpath('//main[@id="job-list"]//div/article/div/div[@class="SerpJob-metaInfo"]//div[@class="SerpJob-metaInfoLeft"]/span/text()[2]').extract()
Upvotes: 0
Views: 307
Reputation: 2335
I'm not quite sure whether you're trying to use Scrapy or requests. Looks like you're wanting to use requests but with xpath selectors.
For websites like this, it's best to look at each individual job advert as a 'card'. You want to loop over each card with the XPATH selectors that you need to get the data you want.
card = sel.xpath('//div[@class="SerpJob-jobCard card"]')
for a in card:
title = a.xpath('.//a[@class="card-link"]/text()').get()
company = a.xpath('.//span[@class="JobPosting-labelWithIcon jobposting-company"]/text()').get()
salary = a.xpath('.//span[@class="jobposting-salary"]/text()').get()
location = a.xpath('.//span[@class="jobposting-location"]/text()').get()
You want to search each card with relative XPATH selectors. The .//
searches within the chunk of HTML downstream of the card
variable.
Always use get()
instead of extract()
. get()
is used to get one value and returns a string always, here that's what we want when we're looping over each card. extract()
extracts all values if there are multiple and if there's only one value for the XPATH selector it puts it into a list which is often not what you want. The ambiguity of extract()
is not ideal, if you want multiple values to use getall()
, this is explicit and will only give you multiple values.
If you're finding you're not getting the correct data in the right format, always look to see if javascript content is being added to the website. Turn off your browsers javascript to refresh the page. On this particular site, none of the data you require is loaded by javascript, this makes it much easier to scrape.
Upvotes: 1