Lucas Magalhães
Lucas Magalhães

Reputation: 37

xpath result from scrapy don't show the same result from a html page

I'm having some issues in crawling this website search:

https://www.simplyhired.com/search?q=data+engineer&l=United+States&pn=1&job=ZMzeXt6JW0jMuZc6H-3Af3sqOGzeQMLj7X5mnXXv9ZteeAoGm6oDdg

I'm trying to extract these elements from de SimplyHired search jobs for Data Engineer in US:

enter image description here

But when I try using xpath locator to any of them using selector module I'm getting different results and in different order.

Also the output for all of them isn't matching (The index corresponding to xpath job name is not the same index for ther location in xpath location for example).

Here is my code:

from scrapy import Selector
import requests

response = requests.get('https://www.simplyhired.com/search?q=data+engineer&l=united+states&mi=exact&sb=dd&pn=1&job=X1yGOt2Y8QTJm0tYqyptbgV9Pu19ge0GkVZK7Im5WbXm-zUr-QMM-A').content

sel=Selector(text=response)

#job name
sel.xpath('//main[@id="job-list"]/div/article[contains(@class,"SerpJob")]/div/div[@class="jobposting-title-container"]/h2/a/text()').extract()

#company
sel.xpath('//main[@id="job-list"]/div/article/div/h3[@class="jobposting-subtitle"]/span[@class="JobPosting-labelWithIcon jobposting-company"]/text()').extract()

#location
sel.xpath('//main[@id="job-list"]//div/article/div/h3[@class="jobposting-subtitle"]/span[@class="JobPosting-labelWithIcon jobposting-location"]/span/span/text()').extract()

#salary estimates
sel.xpath('//main[@id="job-list"]//div/article/div/div[@class="SerpJob-metaInfo"]//div[@class="SerpJob-metaInfoLeft"]/span/text()[2]').extract()

Upvotes: 0

Views: 307

Answers (1)

AaronS
AaronS

Reputation: 2335

I'm not quite sure whether you're trying to use Scrapy or requests. Looks like you're wanting to use requests but with xpath selectors.

For websites like this, it's best to look at each individual job advert as a 'card'. You want to loop over each card with the XPATH selectors that you need to get the data you want.

Code Example

card = sel.xpath('//div[@class="SerpJob-jobCard card"]')
for a in card:
    title = a.xpath('.//a[@class="card-link"]/text()').get()
    company = a.xpath('.//span[@class="JobPosting-labelWithIcon jobposting-company"]/text()').get() 
    salary = a.xpath('.//span[@class="jobposting-salary"]/text()').get()
    location = a.xpath('.//span[@class="jobposting-location"]/text()').get()

Explanation

You want to search each card with relative XPATH selectors. The .// searches within the chunk of HTML downstream of the card variable.

Always use get() instead of extract(). get() is used to get one value and returns a string always, here that's what we want when we're looping over each card. extract() extracts all values if there are multiple and if there's only one value for the XPATH selector it puts it into a list which is often not what you want. The ambiguity of extract() is not ideal, if you want multiple values to use getall(), this is explicit and will only give you multiple values.

Additional Information

If you're finding you're not getting the correct data in the right format, always look to see if javascript content is being added to the website. Turn off your browsers javascript to refresh the page. On this particular site, none of the data you require is loaded by javascript, this makes it much easier to scrape.

Upvotes: 1

Related Questions