Reputation: 110
I have a HTML page with below data :
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td style="width:50%;padding-right:8px;" valign="top">
<h2 class="sectionTitle">Corporate Headquarters</h2>
<div itemprop="workLocation">6901 Professional Parkway East<br />Sarasota, Florida 34240<br /><br />United States<br /><br /></div><span class="detail">Phone</span>: <span itemprop="telephone">941-556-2601</span><br /><span class="detail">Fax</span>: <span itemprop="faxNumber">--</span>
<h2 class="sectionTitle">Board Members Memberships</h2>
<div>2011-Present</div><div><strong>Director</strong></div><div style="padding-bottom:15px;"><a href="../../stocks/snapshot/snapshot.asp?capId=11777224">TrustWave Holdings, Inc.</a></div><div>2018-Present</div><div><strong>President, CEO & Director</strong></div><div style="padding-bottom:15px;"><a href="../../stocks/snapshot/snapshot.asp?capId=22751">Roper Technologies, Inc.</a></div>
<h2 class="sectionTitle">Education</h2>
<div><strong>MBA</strong> </div><div style="padding-bottom:15px;" itemprop="alumniOf">Harvard Business School</div><div><strong>Unknown/Other Education</strong> </div><div style="padding-bottom:15px;" itemprop="alumniOf">Miami University</div><div><strong>Bachelor's Degree</strong> </div><div style="padding-bottom:15px;" itemprop="alumniOf">Miami University</div>
<h2 class="sectionTitle">Other Affiliations</h2>
<div><a itemprop="affiliation" href="../../stocks/snapshot/snapshot.asp?capId=424885">MedAssets, Inc.</a></div><div><a itemprop="affiliation" href="../../stocks/snapshot/snapshot.asp?capId=1131022">Harvard Business School</a></div><div><a itemprop="affiliation" href="../../stocks/snapshot/snapshot.asp?capId=4109057">Miami University</a></div><div><a itemprop="affiliation" href="../../stocks/snapshot/snapshot.asp?capId=6296385">MedAssets Net Revenue Systems, LLC</a></div><div><a itemprop="affiliation" href="../../stocks/snapshot/snapshot.asp?capId=11777224">TrustWave Holdings, Inc.</a></div><div><a itemprop="affiliation" href="../../stocks/snapshot/snapshot.asp?capId=138296355">Medassets Services LLC</a></div>
</td>
I'm trying to extract the information about "Board Members Memberships" as
Director
TrustWave Holdings, Inc.
CEO & Director
Roper Technologies, Inc.
These do not have any class or id for easy extraction.
But, All i'm able to do is :
soup.find('td',style="width:50%;padding-right:8px;").findAll("strong")
This gives me the following result :
[<strong>Director</strong>,
<strong>President, CEO & Director</strong>,
<strong>MBA</strong>,
<strong>Unknown/Other Education</strong>,
<strong>Bachelor's Degree</strong>]
Can someone please guide me how to do it ?
Upvotes: 2
Views: 5373
Reputation: 1402
You can play with navigation options provided in BeautifulSoup. A couple of loops and some conditional statements will help you to achieve what are you looking for.
Step 1: Iterate over all the div
tags.
Step 2: Filter the positions by checking if the div
contains strong
.
Step 3: Navigate for the next div
tag of the parent i.e. div
with position.
Step 4: Filter the siblings with the condition that the company names are wrapped in a
tag.
titles = soup.find_all('div')
for title in titles:
if title.strong:
company = title.find_next_sibling('div')
if company.a:
person_title = title.text
person_company = company.text
print(person_company, person_title)
Hope this helps! Cheeers!
Upvotes: 3
Reputation: 2104
My Python skills are a bit rusty, so I'll give you an answer in pseudo code to get you on your way. Good luck!!
result = ""
tdContent = soup.find('td',style="width:50%;padding-right:8px;")
headers = tdContent.findAll("h2")
for header in headers:
if header.text == "Board Members Memberships":
for (item = header; item.name != "h2"; item = item.nextSibling):
if item.hasChild(strong):
result += item.getChild(strong).getText + END_LINE
if item.hasChild(a):
result += item.getChild(a).getText + END_LINE
Upvotes: 1