Reputation: 371
I have tried in anger to parse the following representative HTML extract, using BeautifulSoup and lxml:
[<p class="fullDetails">
<strong>Abacus Trust Company Limited</strong>
<br/>Sixty Circular Road
<br/>DOUGLAS
<br/>ISLE OF MAN
<br/>IM1 1SA
<br/>
<br/>Tel: 01624 689600
<br/>Fax: 01624 689601
<br/>
<br/>
<span class="displayBlock" id="ctl00_ctl00_bodycontent_MainContent_Email">E-mail: </span>
<a href="mailto:email@abacusion.com" id="ctl00_ctl00_bodycontent_MainContent_linkToEmail">email@abacusion.com</a>
<br/>
<span id="ctl00_ctl00_bodycontent_MainContent_Web">Web: </span>
<a href="http://www.abacusiom.com" id="ctl00_ctl00_bodycontent_MainContent_linkToSite">http://www.abacusiom.com</a>
<br/>
<br/><b>Partners(s) - ICAS members only:</b> S H Fleming, M J MacBain
</p>]
What I want to do:
Extract 'strong' text into company_name
Extract 'br' tags text into company_line_x
Extract 'MainContent_Email' text into company_email
Extract 'MainContent_Web' text into company_web
The problems I was having:
1) I could extract all text by using .findall(text=True), but there was a lot of padding in each line
2) Non-ASCII chars are sometimes returned and this would cause the csv.writer to fail.. I'm not 100% sure how to handle this correctly. (I previously just used unicodecsv.writer)
Any advice would be MUCH appreciated!
At the moment, my function just receives page data and isolates the 'p class'
def get_company_data(page_data):
if not page_data:
pass
else:
company_dets=page_data.findAll("p",{"class":"fullDetails"})
print company_dets
return company_dets
Upvotes: 2
Views: 795
Reputation: 474221
Here's a complete solution:
from bs4 import BeautifulSoup, NavigableString, Tag
data = """
your html here
"""
soup = BeautifulSoup(data)
p = soup.find('p', class_='fullDetails')
company_name = p.strong.text
company_lines = []
for element in p.strong.next_siblings:
if isinstance(element, NavigableString):
text = element.strip()
if text:
company_lines.append(text)
company_email = p.find('span', text=lambda x: x.startswith('E-mail:')).find_next_sibling('a').text
company_web = p.find('span', text=lambda x: x.startswith('Web:')).find_next_sibling('a').text
print company_name
print company_lines
print com[enter link description here][1]pany_email, company_web
Prints:
Abacus Trust Company Limited
[u'Sixty Circular Road', u'DOUGLAS', u'ISLE OF MAN', u'IM1 1SA', u'Tel: 01624 689600', u'Fax: 01624 689601', u'S H Fleming, M J MacBain']
email@abacusion.com http://www.abacusiom.com
Note that to get the company lines we have to iterate over the strong
tag's next siblings and get all of the text nodes. company_email
and company_web
are retrieved by labels, in other words, by the text of preceding to them span
tags.
Upvotes: 3
Reputation: 1824
Same as you have done for p
data, by using findall()
(I use lxml
for the below sample codes)
To get company name:
company_name = ''
for strg in root.findall('strong'):
company_name = strg.text # this will give you Abacus Trust Company Limited
To get company lines/details:
company_line_x = ''
lines = []
for b in root.findall('br'):
if b.tail:
addr_line = b.tail.strip()
lines.append(addr_line) if addr_line != '' else None
company_line_x = ', '.join(lines) # this will give you Sixty Circular Road, DOUGLAS, ISLE OF MAN, IM1 1SA, Tel: 01624 689600, Fax: 01624 689601
Upvotes: 1