radi0
radi0

Reputation: 23

Extracting HTML tag content with xpath from a specific website

I am trying to extract the contents of a specific tag on a webpage by using lxml, namely on Indeed.com.

Example page: link

I am trying to extract the company name and position name. Chrome shows that the company name is located at

"//*[@id='job-content']/tbody/tr/td[1]/div/span[1]"

and the position name is located at

"//*[@id='job-content']/tbody/tr/td[1]/div/b/font"

This bit of code tries to extract those values from a locally saved and parsed copy of the page:

import lxml.html as h

xslt_root = h.parse("Temp/IndeedPosition.html")
company = xslt_root.xpath("//*[@id='job-content']/tbody/tr/td[1]/div/span[1]/text()")
position = xslt_root.xpath("//*[@id='job-content']/tbody/tr/td[1]/div/b/font/text()")
print(company)
print(position)

However, the print commands return empty strings, meaning nothing was extracted!

What is going on? Am I using the right tags? I don't think these are dynamically generated since the page loads normally with javascript disabled.

I would really appreciate any help with getting those two values extracted.

Upvotes: 2

Views: 104

Answers (2)

wp78de
wp78de

Reputation: 18980

Try it like this:

company = xslt_root.xpath("//div[@data-tn-component='jobHeader']/span[@class='company']/text()")
position = xslt_root.xpath("//div[@data-tn-component='jobHeader']/b[@class='jobtitle']//text()")
['The Habitat Company']
['Janitor-A (Scattered Sites)']

Once we have the //div[@data-tn-component='jobHeader'] path things become pretty straightforward:

  1. select the text of the child span /span[@class='company']/text() to get the company name
  2. /b[@class='jobtitle']//text() is a bit more convoluted: since the job title is embedded in a font tag. But we can just select any descendant text using //text() to get the position.

    An alternative is to select the b or font node and use text_content() to get the text (recursively, if needed), e.g.
    xslt_root.xpath("//div[@data-tn-component='jobHeader']/b[@class='jobtitle']")[0].text_content()

Upvotes: 1

dennlinger
dennlinger

Reputation: 11490

Despite your assumption, it seems that the content on the page is loaded dynamically, and is thus not present during loading time.
This means you can't access the elements from your downloaded HTML file (if you do not believe me, try to look for job-content in the actual file on your computer, which will only contain placeholders and descriptors.

It seems you would have to use technologies like Selenium to perform this task. Again, I want to stress that whatever you are doing (automatically), is a violation of indeed.com's Terms and Conditions, so I would suggest not to go too far with this anyways.

Upvotes: 0

Related Questions