Reputation: 23
I am trying to extract the contents of a specific tag on a webpage by using lxml, namely on Indeed.com.
Example page: link
I am trying to extract the company name and position name. Chrome shows that the company name is located at
"//*[@id='job-content']/tbody/tr/td[1]/div/span[1]"
and the position name is located at
"//*[@id='job-content']/tbody/tr/td[1]/div/b/font"
This bit of code tries to extract those values from a locally saved and parsed copy of the page:
import lxml.html as h
xslt_root = h.parse("Temp/IndeedPosition.html")
company = xslt_root.xpath("//*[@id='job-content']/tbody/tr/td[1]/div/span[1]/text()")
position = xslt_root.xpath("//*[@id='job-content']/tbody/tr/td[1]/div/b/font/text()")
print(company)
print(position)
However, the print commands return empty strings, meaning nothing was extracted!
What is going on? Am I using the right tags? I don't think these are dynamically generated since the page loads normally with javascript disabled.
I would really appreciate any help with getting those two values extracted.
Upvotes: 2
Views: 104
Reputation: 18980
Try it like this:
company = xslt_root.xpath("//div[@data-tn-component='jobHeader']/span[@class='company']/text()")
position = xslt_root.xpath("//div[@data-tn-component='jobHeader']/b[@class='jobtitle']//text()")
['The Habitat Company']
['Janitor-A (Scattered Sites)']
Once we have the //div[@data-tn-component='jobHeader']
path things become pretty straightforward:
/span[@class='company']/text()
to get the company name/b[@class='jobtitle']//text()
is a bit more convoluted: since the job title is embedded in a font tag. But we can just select any descendant text using //text()
to get the position.
An alternative is to select the b
or font
node and use text_content()
to get the text (recursively, if needed), e.g.
xslt_root.xpath("//div[@data-tn-component='jobHeader']/b[@class='jobtitle']")[0].text_content()
Upvotes: 1
Reputation: 11490
Despite your assumption, it seems that the content on the page is loaded dynamically, and is thus not present during loading time.
This means you can't access the elements from your downloaded HTML file (if you do not believe me, try to look for job-content
in the actual file on your computer, which will only contain placeholders and descriptors.
It seems you would have to use technologies like Selenium to perform this task. Again, I want to stress that whatever you are doing (automatically), is a violation of indeed.com's Terms and Conditions, so I would suggest not to go too far with this anyways.
Upvotes: 0