Extracting HTML tag content with xpath from a specific website

Question

I am trying to extract the contents of a specific tag on a webpage by using lxml, namely on Indeed.com.

Example page: link

I am trying to extract the company name and position name. Chrome shows that the company name is located at

"//*[@id='job-content']/tbody/tr/td[1]/div/span[1]"

and the position name is located at

"//*[@id='job-content']/tbody/tr/td[1]/div/b/font"

This bit of code tries to extract those values from a locally saved and parsed copy of the page:

import lxml.html as h

xslt_root = h.parse("Temp/IndeedPosition.html")
company = xslt_root.xpath("//*[@id='job-content']/tbody/tr/td[1]/div/span[1]/text()")
position = xslt_root.xpath("//*[@id='job-content']/tbody/tr/td[1]/div/b/font/text()")
print(company)
print(position)

However, the print commands return empty strings, meaning nothing was extracted!

What is going on? Am I using the right tags? I don't think these are dynamically generated since the page loads normally with javascript disabled.

I would really appreciate any help with getting those two values extracted.

wp78de · Accepted Answer

Try it like this:

company = xslt_root.xpath("//div[@data-tn-component='jobHeader']/span[@class='company']/text()")
position = xslt_root.xpath("//div[@data-tn-component='jobHeader']/b[@class='jobtitle']//text()")
['The Habitat Company']
['Janitor-A (Scattered Sites)']

Once we have the //div[@data-tn-component='jobHeader'] path things become pretty straightforward:

select the text of the child span /span[@class='company']/text() to get the company name
/b[@class='jobtitle']//text() is a bit more convoluted: since the job title is embedded in a font tag. But we can just select any descendant text using //text() to get the position.

An alternative is to select the b or font node and use text_content() to get the text (recursively, if needed), e.g.
xslt_root.xpath("//div[@data-tn-component='jobHeader']/b[@class='jobtitle']")[0].text_content()

Extracting HTML tag content with xpath from a specific website

Answers (2)

Related Questions