Reputation: 10433
I'm scraping a page and I have to get the number of employees from this format:
<h5>Number of Employees</h5>
<p>
20
</p>
I need to get the number "20" the problem is that this numbers isn't always in the same header, sometimes is in "h4" and there are more ''h5" headers, so I need to find the data that is contained in the header named: "Number of Employees" and the extract the number that is in the contained paragraph
This is the link of the page
Upvotes: 0
Views: 196
Reputation: 18799
'normalize-space(//*[self::h4 or self::h5][contains(., "Number of Employees")]/following-sibling::p[1]/text())'
Upvotes: 0
Reputation: 38
Well, the easiest way is to find an element that contains the "Number of Employees"-text, and then simply take the paragraph after that, assuming that the paragraph always follows right after.
Here's a quick and dirty piece of code that does this, and prints the number out:
parent = soup.find("div", id='business-additional-info-text')
for child in parent.children:
if("Number of Employees" in child):
print(child.findNext('p').contents[0].strip())
Upvotes: 1